Category Archives: iAD

Interesting Wolfram Alpha statistics

Here is the answer you get from entering "budget surplus" into Wolfram Alpha:

image 

Two things I did not know: The fifth largest government surplus in the world is held by Serbia, which surprises me, given that the country has 14% unemployment and a recovering economy, according to Wikipedia. And that Japan’s deficit is very close to the US’, indicating that things are not as bad in the US as you might think. Or perhaps that the numbers are a bit dated, but according to the source information, most of the numbers are from 2009.

Since May 17th is Norway’s national day, I think it behooves me to point out that of the five surplus states listed above, Norway is the nicest place to live, by most measures (weather, culture, politics, human rights, health care, etc. etc.). On the other hand, many of the countries with large deficits are nice places to live, so I wouldn’t read too much into the economics at all…

(Hat tip to Karthik, who retweeted one of my tweets, which I misunderstood and started researching….)

Notes from Stephen Wolfram webcast

These are my raw notes from the session with Stephen Wolfram on the pre-launch of the Wolfram Alpha service at the Berkman center. Unfortunately, I was on a really bad Internet connection and only got the sound, and missed the first 20 minutes or so running around trying to find something better.

Notes from Stephen Wolfram on Alpha debut

…discussion of queries:
– nutrition in a slice of cheddar
– height of Mount Everest divided by length of Golden Gate bridge
– what’s the next item in this sequence
– type in a random number, see what it knows about it
– "next total solar eclipse"

What is the technology?
– computes things, it is harder to find answers on the web the more specifically you ask
– instead, we try to compute using all kinds of formulas and models created from science and package it so that we can walk up to a web site and have it provide the answer

– four pieces of technology:
— data curation, trillions of pieces of curated data, free/licensed, feeds, verify and clean this (curate), built industrial data curation line, much of it requires human domain expertise, but you need curated data
— algorithms: methods and models, expressed in Mathematica, there is a finite number of methods and models, but it is a large number…. now 5-6 million lines of math code
— linguistic analysis to understand input, no manual or documentation, have to interpret natural language. This is a little bit different from trad NL processing. working with more limited set of symbols and words. Many new methods, has turned out that ambiguity is not such a bit problem once we have mapped it onto a symbolic representation
— ability to automate presentation of things. What do you show people so they can cognitively grasp what you are, requires computational esthetics, domain knowledge.

Will run on 10k CPUs, using Grid Mathematica.
90% of the shelves in a typical reference library we have a decent start on
provide something authoritative and then give references to something upstream that is
know about ranges of values for things, can deal with that
try to give footnotes as best we can

Q: how do you deal with keeping data current
– many people have data and want to make it available
– mechanism to contribute data and mechanism for us to audit it

first instance is for humans to interact with it
there will be a variance of APIs,
intention to have a personalizable version of Alpha
metadata standards: when we open up our data repository mechanism, wn we use that can make data available

Questions from audience:

Differences of opinion in science?
– we try to give a footnote
– Most people are not exposed to science and engineering, you can do this without being a scientist

How much will you charge for this?
– website will be free
– corporate sponsors will be there as well, in sidebars
– we will know what kind of questions people ask, how can we ingest vendor information and make it available, need a wall of auditing
– professional version, subscription service

Can you combine databases, for instance to compute total mass of people in England?
– probably not automatically…
– can derive it
– "mass of people in England"
– we are working on the splat page, what happens when it doesn’t know, tries to break the query down into manageable parts
300th largest country in Europe? – answers "no known countries"

Data sources? Population of Internet users. how do you choose?
– identifying good sources is a key problem
– we try do it well, use experts, compare
– US government typically does a really good job
– we provide source information
– have personally been on the phone with many experts, is the data knowable?
– "based on available mortality data" or something

Technology focus in the future, aside from data curation?
– all of them need to be pushed forward
– more, better, faster of what we have, deeper into the data
– being able to deal with longer and more complicated linguistics
– being able to take pseudocode
– being able to take raw data or image input
– it takes me 5-10 years to understand what the next step is in a project…

How do you see this in contrast with semantic web?
– if the semantic web had been there, this would be much easier
– most of our data is not from the web, but from databases
– within Wolfram Alpha we have a symbolic ontology, didn’t create this as top down, mostly bottom-up from domains, merged them together when we realized similarities
– would like to do some semantic web things, expose our ontological mechanisms

At what point can we look at the formal specs for these ontologies?
– good news: All in symbolic mathematical code
– injecting new knowledge is complicated – nl is surprisingly messy, such as new terms coming in, for instance putting in people and there is this guy called "50 cent"
– exposure of ontology will happen
– the more words you need to describe the question, the harder it is
– there are holes in the data, hope that people will be motivated to fill them in

Social network? Communities?
– interesting, don’t know yet

How about more popular knowledge?
– who is the tallest of Britney Spears and 50 cent
– popular knowledge is more shallowly computable than scientific information
– linguistic horrors, book names and such, much of it clashes
– will need some popularity index, use Wikipedia a lot, can determine whether a person is important or not

The meaning of life? 42….

Integration with CYC?
– CYC is most advanced common sense reasoning system
– CYC takes what they reason about things and make it computing strengths
– human reasoning not that good when it comes to physics, more like Newton and using math

Will you provide the code?
– in Mathematica, code tends to be succinct enough that you can read it
– state of the art of synthesizing human-readable theorems is not that good yet
– humans are less efficient than automated and quantitative qa methods
– in many cases you can just ask it for the formula
– our pride lies in the integration, not in the models, for they come from the world
– "show formula"

Will this be integrated into Mathematica?
– future version will have a special mode, linguistic analysis, pop it to the server, can use the computation

How much more work on the natural language side?
– we don’t know
– pretty good at removing linguistic fluff, have to be careful
– when you look at people interacting with the system, but pretty soon they get lazy, only type in the things they need to know
– word order irrelevant, queries get pared down, we see deep structure of language
– but we don’t know how much further we need to go

How does this change the landscape of public access to knowledge?
– proprietary databases: challenge is make the right kind of deal
– we have been pretty successful
– we can convince them to make it casually available, but we would have to be careful that the whole thing can’t be lifted out
– we have yet to learn all the issues here

– have been pleasantly surprised by the extent to which people have given access
– there is a lot of genuinely good public data out there

This is a proprietary system – how do you feel about a wiki solution outcompeting you?
– that would be great, but
– making this thing is not easy, many parts, not just shovel in a lot of data
– Wikipedia is fantastic, but it has gone in particular directions. If you are looking for systematic data, properties of chemicals, for instance, over the course of the next two years, they get modified and there is not consistency left
– the most useful thing about Wikipedia is the folk knowledge you get there, what are things called, what is popular
– have thought about how to franchise out, it is not that easy
– by the way, it is free anyway…
– will we be inundated by new data? Encouraged by good automated curation pipelines. I like to believe that an ecosystem will develop, we can scale up.
– if you want this to work well, you can’t have 10K people feeding things in, you need central leadership

Interesting queries?
– "map of the cat" (this is what I call artificial stupidity)
– does not know anatomy yet
– how realtime is stock data? One minute delayed, some limitations
– there will be many novelty queries, but after that dies down, we are left with people who will want to use this every day

How will you feel if Google presents your results as part of their results?
– there are synergies
– we are generating things on the fly, this is not exposable to search engines
– one way to do it could be to prescan the search stream and see if wolfram alpha can have a chance to answer this

Role for academia?
– academia no longer accumulates data, useful for the world, but not for the university
– it is a shame that this has been seen as less academically respectable
– when chemistry was young, people went out and looked at every possible molecule
– this is much to computer complicated for the typical libraries
– historical antecedents may be Leibniz’ mechanical and computational calculators, he had the idea, but 300 years too early

When do we go live?
… a few weeks
– maybe a webcast if we dare…

Jon Udell on observable work

Jon Udell has a great presentation over at Slideshare on how to work in observable spaces – something that should be done, to a much larger extent, by academics. I quite agree (and really need to get better at this myself):

What if you could remember everything?

I was delighted when I found this video, where James May (the cerebral third of Top Gear) talks to professor Alan Smeaton of Dublin City University about lifelogging – the recording of everything that happens to a person over a period of time, coupled with the construction of tools for making sense of the data.

In this example, James May wears a Sensecam for three days. The camera records everything he does (well, not everything, I assume – if you want privacy, you can always stick it inside your sweater) by taking a picture every 30 seconds, or when something (temperature, IR rays in front (indicating a person) or GPS location) changes. As it is said in the video, some people have been wearing these cameras for years – in fact, one of my pals from the iAD project, Cathal Gurrin, has worn one for at least three years. (He wore it the first time we met, where it snapped a picture of me with my hand outstretched.)

The software demonstrated in the video groups the pictures into events, by comparing the pictures to each other. Of course, many of the pictures can be discarded in the interest of brevity – for instance, for anyone working in an office and driving to work, many of the pictures will be of two hands on a keyboard or a steering wheel, and can be discarded. But the rest remains, and with powerful computers you can spin through your day and see what you did on a certain date.

And here is the thing: This means that you will increasingly have the option of never forgetting anything again. You know how it is – you may have forgotten everything about some event, and then something – a smell, a movement, a particular color – makes you remember by triggering whatever part (or, more precisely, which strands of your intracranial network) of your brain this particular memory is stored. Memory is associative, meaning that if we have a few clues, we can access whatever is in there, even though it had been forgotten.

Now, a set of pictures taken at 30-second intervals, coupled together in an easy-to-use and powerful interface, that is a rather powerful aide-de-memoire.

Forgetting, however, is done for a purpose – to allow you to concentrate on what you are doing rather than using spare brain cycles in constant upkeep of enormous, but unimportant memories. For this system to be effective, I assume it would need to be helpful in forgetting as well as remembering – and since it would be stored, you would actually not have to expend so much remember things – given a decent interface, you could always look it up again, much as we look things up in a notebook.

Think about that – remembering everything – or, at least being able to recall it at will. Useful – or an unnecessary distraction?

Search and effectiveness in creativity

Effective creativity is often accomplished by copying, by the creation of certain templates that work well, which are then changed according to need and context. Digital technology makes copying trivial, and search technology makes finding usable templates easy. So how do we judge creativity when combintations and associations can be done semi-automatically?

One of my favorite quotes is supposedly by Fyodor Dostoyevsky: "There are only two books written: Someone goes on a journey, or a stranger comes to town." Thinking about it, it is surprisingly easy to divide the books you have read into one or the other. The interesting part, however, lies not in the copying, but in the abstraction: The creation of new categories, archetypes, models and templates from recognizing new dimensions of similarity in previously seemingly unrelated instances of creative work.

Here is a demonstration, fresh from Youtube, demonstrating how Disney reuses character movements, especially in dance scenes:

Of course, anyone who has seen Fantasia recognizes that there are similarities between Disney movies, even schools (the "angular" once represented by 101 Dalmatians, Sleeping Beauty and Mulan, and the more rounded, cutesy ones represented by Bambi, The Jungle Book and Robin Hood. (Tom Wolfe referred to this difference (he was talking about car design, but what the heck, as Apollonian versus Dionysian, and apparently borrowed that distinction from Nietsche. But I digress.)

This video, I suspect, was created by someone recognizing movements, and putting the demonstration together manually. But in the future, search and other information access technologies will allow us to find such dimensions simply by automatically exploring similarities in the digital representations of creative works – computers finding patterns were we do not.

One example (albeit aided by human categorization) of this is the Pandora music service, where the user enters a song or an artist, and Pandora finds music that sounds similar to the song or artist entered. This can produce interesting effects: I found, for instance, that there is a lot of similarity (at least Pandora seems to think so, and I agree, though I didn’t see it myself) between U2 and Pink Floyd. And imagine my surprise when, on my U2 channel (where the seed song was Still haven’t found what I’m looking for) when a song by Julio Iglesias popped up. Normally I wouldn’t be caught dead listening to Julio Iglesias, but apparently this one song was sufficiently similar in its musical makeup to make it into the U2 channel. (I don’t remember the name of the song now, but remember that I liked it.)

In other words, digital technology enables us to discover categorization schemes and visualize them. Categorization is power, because it shapes how we think about and find information. In business terms, new ways to categorize information can mean new business models or at least disruptions of the old. Pandora has interesting implications for artist brand equity, for instance: If I wanted to find music that sounded like U2 before, my best shot would be to buy a U2 record. Now I can listen to my Youtube channel on Pandora and get music from many musicians, most of whom are totally unknown to me, found based on technical comparisons of specific attributes of their music (effectively, a form of factor analysis) rather than the source of the creativity.

imageI am not sure how this will work for artists in general. On one hand, there is the argument that in order to make it in the digital world, you must be more predictable, findable, and (like newspaper headlines) not too ironic. On the other hand, is that if you create something new – a nugget of creativity, rather than a stream – this single instance will achieve wider distribution than before, especially if it is complex and hard to categorize (or, at least, rich in elements that can be categorized but inconclusive in itself.)

 Susan Boyle, the instant surprise on the Britain’s Got Talent show, is now past 20 million views on Youtube and is just that – an instant, rich and interesting nugget of information (and considerable enjoyment) which more or less explodes across the world. She’ll do just fine in this world, thank you very much. Search technology or not…

Google edging closer to being "the new Microsoft"

A few years ago, I wrote an essay about how Microsoft had become the new IBM – i.e., the dominant, love-to-hate company of the computer industry. In this interesting article, John Lanchester discusses how Google now is stepping into that role, with its aggressive moves into making the world searchable, and a lot more than you would like findable. Interesting point:

[…] as Google makes clear, nothing short of a court order is going to stop it digitising every book in print. Google doesn’t accept that that constitutes a violation of copyright. But the company won’t even discuss the physical process by which it scans the books: a classic example of how very free it is with other people’s intellectual property, while being highly protective of its own.

This issue, in all its various forms, isn’t going to go away. Book Search, Street View and many of Google’s other offerings simply bulldoze existing ideas of how things are and how they should be done. I was highly critical of Gmail when it first came in, on the grounds that the superbly effective mail system came at the unacceptable price of allowing Google to scan all emails and place text ads. But I soon began using it, because it was free, and because it’s such good software, and because I frankly never noticed the ads.

He goes on to show how a hard disk crash and a botched backup restore left him without his documents, until it dawned on him that, yes, Gmail had them all, ready for download. So big brothers can be nice, but they are still Big Brothers…

Good survey of web business models

image Box UK, has a very complete survey of web business models. My concept of business models (though for information content and services, not just services) on the web was four: Free, ad-supported, subscription or some form of micropayment (Skype, I would say, is the largest user of this concept, though they pull it off a regularly replenished account).

This site expands that classification a bit, including things like taking payemnt for physical products and letting the service follow the product (which more and more electronics manufacturers do). What strikes me is that there really is nothing new here – all the business models have been around since time immemorial, something the media industries of the world should take note of.

The answer for more and more of the granulated information providers of the world lies in moving either towards the user (becoming, essentially, an integrated part of the user’s information need, such as Bloglines) or to step further from the individual information consumer, becoming a source of content for others to fight over. Paying a license, of course.

Anyway, a good list, and a good survey. Check it out.

(Via Chris Anderson of Wired.)

Shirky on newspapers

Clay Shirky, the foremost essayist on the Internet and its boisterous intrusion into everything, has done it again: Written an essay on something already thoroughly discussed with a new and fresh perspective. This time, it is on the demise of newspapers – the short message is that this is a revolution, and saving newspapers just isn’t going to happen, because this is, well, a revolution:

[..]I remember Thompson [in 1993] saying something to the effect of “When a 14 year old kid can blow up your business in his spare time, not because he hates you but because he loves you, then you got a problem.” I think about that conversation a lot these days.

[..]

Revolutions create a curious inversion of perception. In ordinary times, people who do no more than describe the world around them are seen as pragmatists, while those who imagine fabulous alternative futures are viewed as radicals. The last couple of decades haven’t been ordinary, however. Inside the papers, the pragmatists were the ones simply looking out the window and noticing that the real world was increasingly resembling the unthinkable scenario. These people were treated as if they were barking mad. Meanwhile the people spinning visions of popular walled gardens and enthusiastic micropayment adoption, visions unsupported by reality, were regarded not as charlatans but saviors.

[..]

That is what real revolutions are like. The old stuff gets broken faster than the new stuff is put in its place. The importance of any given experiment isn’t apparent at the moment it appears; big changes stall, small changes spread. Even the revolutionaries can’t predict what will happen. Agreements on all sides that core institutions must be protected are rendered meaningless by the very people doing the agreeing. (Luther and the Church both insisted, for years, that whatever else happened, no one was talking about a schism.) Ancient social bargains, once disrupted, can neither be mended nor quickly replaced, since any such bargain takes decades to solidify.

And so it is today. When someone demands to know how we are going to replace newspapers, they are really demanding to be told that we are not living through a revolution. They are demanding to be told that old systems won’t break before new systems are in place. They are demanding to be told that ancient social bargains aren’t in peril, that core institutions will be spared, that new methods of spreading information will improve previous practice rather than upending it. They are demanding to be lied to.

That simple. He draws the line back to the Gutenberg printing press and the enormous transition that caused – much more chaotic that you would think with 500 year hindsight.

Highly recommended. And another piece of reading for my suffering students….

Wolfram is at it again

Stephen Wolfram’s next project, the Wolfram|Alpha search "engine" (or, rather, answer to everything that is computable) is due out in May visit it here.) To me it seems like a combination of Google, CYC and, perhaps, Mathematica. It certainly is interesting and should do much for factual search, not to mention conversational interfaces to search. Nova Spivack thinks it is as important as Google. Doug Lenat (in the comment field to Spivack’s blog post) says

[…] it’s not AI, and not aiming to be, so it shouldn’t be measured by contrasting it with HAL or Cyc but with Google or Yahoo. At its heart is a formal Mathematica representation. Its inference engine is basically a large number of individually hand-engineered scripts for tapping into data which he and his team have spent the last several years gathering and "curating". For example, he has assembled tables of historical financial information about countries’ GDP’s and about companies’ stock prices. In a small number of cases, he also connects via API to third party information, but mostly for realtime data such as a current stock price or current temperature. Rather than connecting to and relying on the current or future Semantic Web, Alpha computes its answers primarily from his own curated data to the extent possible; he sees Alpha as the home for almost all the information it needs, and will use to answer users’ queries.

Another way of seeing it might be as the latest shot at providing answers by processing rather than storage – which fits nicely with Wolfram’s idea of computational equivalence – that the universe can be described by a simple set of rules, which as far as I understand it means that all complexity is only apparent, not real, and only so because we have not yet understood the underlying algorithms.

I just can’t wait to try it out – and to see what the impact will be on more storage-intensive search engines and their use.

Update March 12: This is garnering some serious attention for a service that isn’t even in beta yet…

Interesting search: Oodle.com

image Oodle.com is a federated search engine for classified ads – it does not (at least as far as I know) have its own ads, but act as a portal to other ad sites, presumably in return for a share of profits.

The value created is partly from the interface technology (enter "Mercedes 450 SEL 6.9" and it knows you are looking for a car and format the page so that you can drill down on models and years) and partly in that it accesses all kinds of local and community-based listings.

When I was looking for my used Mercedes I searched large advertisers such as cars.com and autotrader.com – but they only show their own ads. Oodle.com would have found me more cars (though not any I would have bought rather than the car I did get.) Useful because many markets are local and therefore hidden if you come from outside.

Interesting search

Since I am doing research on search, I thought I would create a list of interesting search-based web sites here, with individual blog entries describing each site and why they are interesting. Here is a starting list, which, of course, will be added to as I discover more interesting sites.

Interfaces:

  • Searchme.com – visual search interface reminiscent of iPod Touch album covers (or, rather, the other way around)
  • New York Times – search-based editorial pages (topic pages) (conversational interface)
  • Times of London – search-based editorial pages (topic pages) defined by user (conversational interface)
  • Yahoo Mindset – intent-driven (or rather, intent-revealing) interface for product search. This is no longer available, but this blogpost has an explanation and a graphic of the "intent slider".

Federated search

  • Oodle – federated search for classified ads
  • Globrix – federated search for real estate in UK

Rich media search

  • SnapTell – instant product identification from mobile photo
  • TinEye – image-matching search (great service, but unfortunately the index is rather small)
  • Shazam – music-matching search for mobile phones (not quite query by humming, but close…) See article in CACM.

Regionals

  • Indian search engines: asklaila.com (local search)
  • Chinese search engines: Baidu (a serious competitor to Google)
  • Sesam.no search engine: Specializing in Norwegian content not easily available on Google, such as relationships between people.

Specialists

  • OpenCalais – metadata generator, useful for understanding how machines read your text

… more to come …

By all means – feel free to make suggestions!

FAST Forward 2009: Notes from the third day

Bjørn Olstad: Microsoft’s vision for enterprise search

Search as a transparent and ubiquitous layer providing information and context seamlessly – from a search box (tell me what you want in 1.4 words and I will answer) to a conversational interface (giving pointers to more information and suggestions for continued searches, to a natural interface.

Demo of Microsoft Surface: Camera interface, can recognize things. Multiuser (as opposed to Apple. Showed an application built on search with touch – whenever you touch an information object a query goes towards an ESP implementation and brings up all the information available on that object.

Very impressive demo of Excel Gemini: How do you fit enterprise data into Excel. (Picture of a VW bug with a jet engine.) Pulls 100 million rows into Excel, sort them (instantly), slices and dices. Built on top of ESP, does extreme compression, takes advantage of high memory, allows publishing of live spreadsheets to Sharepoint. Extremely impressive, worth the whole conference.

Bjørn continues talking about search as a platform: Demoing Globrix.com, where you can ask questions about apartments and houses and get a rich search experience where you can change attributes and the data changes dynamically. Globrix does not hold content themselves, but crawls available content on the web and shows it (much like Kayak.com for airline tickets).

Another demo: Search for entertainment based on location, friends and content. Moving from there to a focused movie site. This is federated search that understands some of the semantics (understands that “David Bowie” refers to a person and therefore only search certain databases.) Also incorporates community (letting users edit the results and feed them back).

FAST AdMomentum – advertising network – has had tremendous growth.

Content analytics: How can you lay a foundation for a good search experience by focusing on data quality? Demo: Content Integration Studio, sucking out semantics from unstructured text and writing it back both to the search engine and to databases (such as an HR database).

Panel session on enterprise search

Hitachi consulting (Ellen): Very big focus on the economy now, almost all conversations are about that topic. eDiscovery is important: Looking at many sources with a view towards risk discovery and risk mitigation.

EMC consulting (Mark Stone): Natural interfaces will be important, frees up the mind to focus on the information rather than the interface. Shows a video of a small girls using the Surface table and how she very quickly starts to focus on the pictures she is manipulating rather than the interface – she completely forgets that she is working with a computer.

Sue Feldman, IDC: We have to get beyond the document paradigm. I want to see interfaces that will immerse me in the sea of information and explore it, without having to think about what application it is in.

Sue Feldman: Core issue with search: Data quality and making it a rich experience for the user. Anthropological, linguistic and cultural issues, getting people to understand both what they are seeing and what they are looking for.  We are just beginning on this journey. From keyword matching and relevance ranking to pulling the user in, having a dialogue with the information. What we are seeing is hybrid systems that combine collaboration, search, analysis etc.

AMR Research: There is a religious war going on, between collaborative systems, portals, content management systems, and search. They all claim to be the answer to the problem of connecting users with their data. There is also consolidation in the market, partially driven by the economy, but there is also a consolidation of functionality and an explosion in new ideas, many small companies coming up with new ideas.  No one technology is going to solve all of these problems. Lots of opportunity because Microsoft is gobbling up all these technologies, trying to provide one product that covers most (Sharepoint).

Q: Examples of interaction management?

Hitachi consulting: Best examples currently found in collaboration and community software.

EMC: There is a tool out there that searches not only blogs, but specifically the comment sections of blogs, looking for mentions of products. Do sentiment analysis, find out what the customers are saying about you.

Sue Feldman: Searching through corporate communications in lawsuit situations. Ad targeting. And what is the relationship between search and innovation?

Hitachi: Innovation comes from finding what you did not expect to find.

Q: This question always comes up: Search is a commodity – or is it? What is the current market doing for search adoption?

AMR: I am not sure who says that, there is so much room for innovation, so I can’t understand why anyone would say it is commoditized. Go out there and find the opportunities.

Sue F: Well, search is a tool, like a screwdriver. But I really need a screwdriver. The toolbox has expanded so much. I see the search market continuing to explode even though the technology is tanking. Possible that we will see a disruption with a new platform based on information management, access and collaboration.

EMC: We are seeing growth, the business will mature because companies have to focus on what the business really needs.

Sue Feldman & others: Search use awards

Customer awards:

  • Best productivity advancement: Verizon Business.
  • Best digital market application (I): McGraw-Hill Platts (doing industry-specific searches, 50% increase in trial subscriptions, 40% increase in revenue.)
  • Best digital market application (II): SPH Search (reader interaction and content integrated with newspaper sources, federated search.)
  • Social computing: Accenture (internal search on people profiles and content)
  • User engagement: Kakaku.com, Japan (700m pageviews, 18m unique users)
  • User engagement: AutoTrader (peak query level of 1500 qps)

Partner awards:

  • Digital market solution: Comperio (use of search for user interaction)
  • Social computing solution: NewsGator (enterprise social computing on top of Sharepoint)
  • User experience solutions: EMC Consulting
  • Partner of the year: Hitachi consulting.

FAST – technology futures and optimization

Notes from various presentations at FAST Forward

Bjørn Olstad and Svein Arne Gylterud: Technology briefing

Two main directions: Fast for Sharepoint and FAST for Internet Business. Various other licensing options. Richer search experience, taking into account time, user profile data, and tagging.

Some new features, available directly: In-picture thumbnails view of docs, can collect powerpoint slides without starting ppt.

FAST Search for Internet Business: Content Integration Studio, (new version), scalability increased (better performance on less server).

FAST AdMomentum: Competitor to Google Adwords and Adsense. AdMomentum is an integrated platform for managing ads, including display ads. Can track user behavior across devices and platforms.

Data increasingly residing in a hybrid infrastructure, need to move to a model with intent in, content out. Moving from text-centric approach to richer media. Will continue multiplatform, but some new components are based on .net and will therefore only run on Windows.

Main innovations: Configuration tool: Same tool for doing indexing of content as for evaluating queries, based on the graphical user interface from CIS. Also innovations in the search core: More context awareness, more scalability.

Mark Stone and Richard Griffin, EMC2: Beneath the Surface: Search Without the (Text) Box – An Insight into the Next-Generation of User Experiences

Going from a command line interface – not really different from Archie – to conversational interfaces (such as FAST) to natural user interfaces (example: Look at an apple through a screen and get information about it superimposed on the image.) Another example: An umbrella, made by Ambient Devices, which has a handle that glows blue when it is going to rain. All this is powered by search.

Key areas for NUI (Natural User Interfaces): Relevancy, simplicity, speed (reducing the time consequence of errors), and unification (get everything in one interface).

Demo: The Look Finder. Dynamic changing of pictures of clothes based on attributes, role models (Kate Moss), color etc.

Richard Griffin: How to design NUIs.: Start with the user, finding out what attributes that are important for them. Do a lot of sketching, Flash-based technology etc. to create an information design. Need three roles: Designer, developer and integrator.

Designers use Photoshop and Illustrator, Integrator uses Blend, developer uses Visual Studio.

Demo: Surface table with an application

Dan Benson and Paul Summers, Microsoft: Making FAST ESP Shine: Best Practices from the Solution Architects

Performance tuning: Indexing latency: Document processing: Entity extraction, clustering, lemmatization, doc conversion, etc. Doc processing is CPU intensive.

Index performance tuning: Use 4-5 partitions, keep partition 0 and 1 small, so that they can be indexed quickly and used for late insertion documents. Consider doing lemmatization in the query expansion rather than in document expansion.

Put small partitions on RAMdisk or SSD.

Query performance tuning: QPS is a critical issue. Queries enter the system through the QR server. Processed and passed down to top level dispatcher, which distributes the query to the low level dispatchers, which sits on individual nodes. In high QPS scenarious, you want fewer partitions, typically 3. Typically, you turn off spellchecking and query-side lemmatization as well as synonym-expansion on the query side (which means you have to do that on the document processing side, which is always the tradeoff. Not much tuning done on the dispatchers. Much tuning can be done on the low level search engines, for each partition.

Navigators and document summaries allows for a lot of performance gain. Can get a lot of performance from not sending fields back that you are not going to display, even though they are searchable. Navigators are costly both for memory and query performance because of the CPU computations necessary. Be careful with the number of navigators requested – on send those that the interfaces need. Wildcards are expensive because it needs to index all varieties of terms. Turn them off on long fields. Hit highlighting is also costly and can be turned off if you don’t need it. Reduce nesting, use filters before you do dynamic ranking, so you have fewer results to rank. Higher up: Minimize the number of hits.

Document capacity: Try to save hardware costs when storing documents. If you want to optimize for this, you need to create more partitions. Keep in mind that a 32bit system has a max of 4G per processor. Archive Indexing Feature allows for adding new nodes on the fly, because it sends data to columns that have capacity. This is useful when sizing the installation.

FASTForward 2009 – impressions from the second day

The second day has less of the “big picture” and more of product announcements and more technical detail. Here are some notes as the day progresses:

Kirk Koenigsbauer, Microsoft: Our enterprise search vision & roadmap

Kirk is responsible for the business side of FAST after the acquisition. He is speaking on Microsoft’s commitment to search, the roadmap and future business directions, including pricing.

About 15% of the research done in MS Research is search-oriented.  10 years support on current FAST products, even non-MS platform.

Search server express now has more than 100,000 downloads. 1/3 of MS enterprise customers have deployed a MS search solution. Partner #s have doubled.

MS vision: Create experiences that combine the magic of software with the power of Internet services across a world of devices. Search is integral to vision.

Demo: Use of search in a business setting, showing documents in a viewer format, extracting keywords and concepts.

Announcing two new products:

  • FAST for Sharepoint, which is FAST ESP integrated into Sharepoint, available at a substantially lower price than FAST ESP, typically 50% lower price. Simpler pricing model: Per-user charge for FAST ESP standalone, included in Sharepoint. Still need to buy a server at 25K a pop, but this is substantially lower price. Will be available from next rollout of Office (wave 14). Will also provide a licensing bridge for those who purchase Sharepoint now.
  • FAST Search for Internet business. New functionality for interaction management (promotions, campaigns etc.), Content Integration Studio (graphical interface for managing content restructuring and content integration), and simplified licensing: Language pack and connectors will be part of the standard package.

Valentin Richter, Raytion: User engagement

Low satisfaction with many search solutions, and 70% of search managers do not study search logs with an eye to improve the experience. Went through a list of common myths about search (such as “people know what they are looking for”.) People want simplicity – they cannot handle expressions and need more of a drill down approach navigating through related information. Installing search platforms immediately needs to a focus on information quality: You find duplicates, you find confidential documents everywhere, and so on – be ready for it both in a technical and organizational sense.

Walton Smith, Booz Allen Hamilton: Case study of use of FAST and Sharepoint

BAH based in Virginia, traditionally centralized, but expanding. 300 partners, all wanting to go in different directions. De facto collaboration tool was Outlook. Created a social computing platform called hello.bah.com. Among the results: Have given access to more esoteric material, which caused issues with indexing. Were able to pull new people from other parts of the organization on a project. Other application: staffing.bah.com, finding people with the right credentials and experience, pulling information from many sources. search.bah.com crawls hell and iShare. About 1/3 of the firm is now using the platform, lots of information on individuals.

Charlene Li: Transformation  based on social technologies

It is all about engaging users in dialogue: H&R Block has a page on Facebook where they discuss tax issues – not trying to pull people in, at least not explicitly. Comcast is on Twitter with their customer service people. Starbucks testing ideas, such as automated purchasing based on a customer card. Beth Israel’s CEO blogs about what it is like to run a hospital. Necessary to change search to include social software: Technorati searches blogs, del.icio.us allows social bookmarking. You can use Twitter mapping to see what people are discussing – showing that what is rated high somewhere may not be what is most discussed. Amazon now lets you filter reviews by friends.

Conclusion: Social networks will be like air, and will transform companies from the outside in. Social media is impacting search at multiple levels, refining results based on personalization details derived from their social circles.

Jørn Ellefsen, Comperio: In search of profits

Comperio has more than 100 customers and have created a front application, Comperio Front, that sits between the customer’s web pages and their search engine. Introduced Drew Brunell who works with SEO for, among others, News International. Paid search is the growing part of the advertising market, everything else is either flat (display ads) or sinking (traditional ads). Doing a lot of experimentation linking into customer behavior – for instance, matching content with areas that see a lot of conmments, “invisible newspapers”. Another notion is the “curated content model”, setting up pages with a blend of original content with stuff from the outside web. Topic pages based on “zero-term search”, offering editorial content put together automatically around. Stefan Sveen, CTO Comperio, demonstrated topic pages from Times Online: User and journalists can create their own topic pages, based on search results and mark entries coming in after the page is created.

Venkat Krishnamoorthy, Thomson Reuters: Delivering Contextual and Intelligent Information to Premium Customers

Reuters delivers context-sensitive information for pre-investment analysis to premiere customers. They have done this for a long time, but want to change from being a data-delivery company, but to integrate into  the user’s workflow. Challenges here included having too many applications the customers needed to stitch together, finding information was difficult, especially across different kinds of assets – more than 40 content databases.  Solution: Put in a search and navigation layer between their desktop products (they have two, a web-based one and a premium, client-based one).

Seeking PhD candidates for iAD project

(Note: This is not the official announcement, which you can find here, where you will also find a link to the application program. I post this here because this blog is easier to update, allows me to link to pertinent information more easily, allows pictures, and allows comments and questions.)

 

clip_image002[5]clip_image002

Announcement: Available Ph. D. Scholarships in Technology Strategy

 

BI Norwegian School of Management is inviting applications for scholarships in technology strategy. The scholarships are made available through the iAD Center for Research-based Innovation, an eight-year research project funded by the Norwegian Research Council and hosted by FAST Search and Transfer, a Microsoft Company. The candidates will pursue their Ph. D. through the doctoral programs of BI Norwegian School of Management and do their thesis research on topics of interest to the iAD project.

Continue reading

Clayton Christensen on health care disruption

Here is Clayton Christensen giving a talk on disruptions in health care (but really a good introduction on disruption in general) at MIT:

http://mitworld.mit.edu/flash/player/Main.swf?host=cp58255.edgefcs.net&flv=mitw-01023-esd-innovator-prescription-christensen-13may2008&preview=http://mitworld.mit.edu//uploads/mitwstill-01023-esd-innovator-prescription-christensen-13may2008.jpg

 

Note that Clay uses Øystein Fjeldstad’s Value Configurations framework a little before 1:00:00 – a result of many conversations aboard the "Disruptive Cruise" which I arranged last year…. don’t say we aren’t doing our part over here….

Ozzie and the cloud

Steven Levy, a tech writer whose every article I read if I can get my hands on it, has a fascinating Wired article about Ray Ozzie and his long march to make Microsoft survive and prosper in the cloud. Service-based computing can be a disruptive innovation for Microsoft, since customers become less reliant on a single, fat client (dominated by MS) and instead can use a  browser as their main interface.

I have used Lotus Notes since well before the company was bought by IBM, and always considered it to be a fantastic platform that is somewhat underused, chiefly because while its execution is great, the user interface is somewhat clumsy (getting better, but still) and it is hard to program for. As an infrastructure play for a large corporation, Notes is just great. As a platform for software innovation and innovative interaction, it leaves a lot to be desired. The question is – can Microsoft gain dominance in this market (Sharepoint seems to execute on that one), extend it to consumers (Vista is not a good omen here), and somehow find a business model that works? (By that I don’t mean one with it the same profitability as it has now, that just isn’t possible. But one that is somewhat profitable long-term?)

If anyone is going to be able to pull that off, it will be Ozzie. The article paints, as I see it, a very complete picture and tells me a lot more about the relationship between Microsoft and Ozzie than I knew. But that is usual with Steven Levy articles, ever since he wrote "Hackers: Heroes of the Computer Revolution" back in 1984.

Highly recommended. (And since I like long and detailed articles: this one is at 6900 words or more than 40,000 characters including spaces. Just a hint to my Norwegian newspaper friends, who thinks anything more than 7000 chars won’t be read by anyone.)

Tim O’Reilly nails it on cloud computing

In this long and very interesting post, Tim O’Reilly divides cloud computing into three types: Utility, platform and end-user applications, and underscores that network effects rather than cost advantages will be what drives economics in this area. (This in contrast to the Economist’s piece this week, which places relatively little emphasis on this, instead talking about the simplification of corporate data centers – though the Economist piece is focused on corporate IT.)

Network effects happen when having new users on a platform or service are a benefit to the other users. This benefit can come from platform integration – for instance, if we both share the same service we can do things within that service that may not be possible between services, due to differences in implementation or lack of translating standards.

Another benefit comes when the shared service can leverage individual users’ activities. Google’s Gmail, for instance, has a wonderful spam filter, which is very reliable because it tracks millions of users’ selections on what is spam and what isn’t.

Tim focuses on the network effects of developers, which is an important reason why Microsoft, not Apple, won the microcomputer war. When Steve Ballmer jumped around shouting "developers, developers, developers", he was demonstrating a sound understanding of what made his business take off – and was willing to make a fool of himself to prove it.

Tim also invokes Clay Christensen’s "law of conservation of attractive profits", arguing that as software becomes commoditized, opportunities for profits will spring up in adjacent markets. In other words, someone (Jeff Bezos? Larry and Sergei?) need to start jumping up and down, shouting "mashupers, mashupers, mashupers" or perhaps "interactors, interactors, interactors" and, more importantly, provide a business model for those that build value-added services on top of the widely shared platforms and easily available applications they provide.

One way to do that could be to make available some of the data generated by user activities, which today most of these companies keep closely to themselves.  That will require balancing on a sharp edge between providing useful data, taking care of user privacy, and not giving away your profitability too much. As my colleague Øystein Fjeldstad and I wrote in an article a few years ago – the companies playing in this field will have to make some hard decisions between growing the pie and securing the biggest piece for themselves.

If we cannot harness network effects, cloud computing becomes a cost play, and after awhile about as interesting, in terms of technical evolution, as utilities are now. USA is behind Europe and Asia in mobile phone systems partially because US cellphone companies were late in developing advanced interconnect and roaming agreements, instead trying to herd customers into their own network. Let’s hope the cloud computing companies have learned something from this….