FAST – technology futures and optimization

Notes from various presentations at FAST Forward

Bjørn Olstad and Svein Arne Gylterud: Technology briefing

Two main directions: Fast for Sharepoint and FAST for Internet Business. Various other licensing options. Richer search experience, taking into account time, user profile data, and tagging.

Some new features, available directly: In-picture thumbnails view of docs, can collect powerpoint slides without starting ppt.

FAST Search for Internet Business: Content Integration Studio, (new version), scalability increased (better performance on less server).

FAST AdMomentum: Competitor to Google Adwords and Adsense. AdMomentum is an integrated platform for managing ads, including display ads. Can track user behavior across devices and platforms.

Data increasingly residing in a hybrid infrastructure, need to move to a model with intent in, content out. Moving from text-centric approach to richer media. Will continue multiplatform, but some new components are based on .net and will therefore only run on Windows.

Main innovations: Configuration tool: Same tool for doing indexing of content as for evaluating queries, based on the graphical user interface from CIS. Also innovations in the search core: More context awareness, more scalability.

Mark Stone and Richard Griffin, EMC2: Beneath the Surface: Search Without the (Text) Box – An Insight into the Next-Generation of User Experiences

Going from a command line interface – not really different from Archie – to conversational interfaces (such as FAST) to natural user interfaces (example: Look at an apple through a screen and get information about it superimposed on the image.) Another example: An umbrella, made by Ambient Devices, which has a handle that glows blue when it is going to rain. All this is powered by search.

Key areas for NUI (Natural User Interfaces): Relevancy, simplicity, speed (reducing the time consequence of errors), and unification (get everything in one interface).

Demo: The Look Finder. Dynamic changing of pictures of clothes based on attributes, role models (Kate Moss), color etc.

Richard Griffin: How to design NUIs.: Start with the user, finding out what attributes that are important for them. Do a lot of sketching, Flash-based technology etc. to create an information design. Need three roles: Designer, developer and integrator.

Designers use Photoshop and Illustrator, Integrator uses Blend, developer uses Visual Studio.

Demo: Surface table with an application

Dan Benson and Paul Summers, Microsoft: Making FAST ESP Shine: Best Practices from the Solution Architects

Performance tuning: Indexing latency: Document processing: Entity extraction, clustering, lemmatization, doc conversion, etc. Doc processing is CPU intensive.

Index performance tuning: Use 4-5 partitions, keep partition 0 and 1 small, so that they can be indexed quickly and used for late insertion documents. Consider doing lemmatization in the query expansion rather than in document expansion.

Put small partitions on RAMdisk or SSD.

Query performance tuning: QPS is a critical issue. Queries enter the system through the QR server. Processed and passed down to top level dispatcher, which distributes the query to the low level dispatchers, which sits on individual nodes. In high QPS scenarious, you want fewer partitions, typically 3. Typically, you turn off spellchecking and query-side lemmatization as well as synonym-expansion on the query side (which means you have to do that on the document processing side, which is always the tradeoff. Not much tuning done on the dispatchers. Much tuning can be done on the low level search engines, for each partition.

Navigators and document summaries allows for a lot of performance gain. Can get a lot of performance from not sending fields back that you are not going to display, even though they are searchable. Navigators are costly both for memory and query performance because of the CPU computations necessary. Be careful with the number of navigators requested – on send those that the interfaces need. Wildcards are expensive because it needs to index all varieties of terms. Turn them off on long fields. Hit highlighting is also costly and can be turned off if you don’t need it. Reduce nesting, use filters before you do dynamic ranking, so you have fewer results to rank. Higher up: Minimize the number of hits.

Document capacity: Try to save hardware costs when storing documents. If you want to optimize for this, you need to create more partitions. Keep in mind that a 32bit system has a max of 4G per processor. Archive Indexing Feature allows for adding new nodes on the fly, because it sends data to columns that have capacity. This is useful when sizing the installation.