Learning from Data at Microsoft

Quantifying improvements to user experience.


The Idea

At Bing, I worked in the Whole Page Relevance group to measure the impact of the full search results page on users. A critical neglected area of search results is information recall. It's common practice to measure the relevance of returned search results, but it is more difficult to measure what relevant results were not returned. I ran experiments using human judges to determine related search terms and how we could expand the reach of search page features to new search queries.

Information Architecture

Wireframes

Answer Selection

Each new study begins by requesting new query generation for an existing answer category. By entering this first, the system is expandable to request additional details based on the answer type. Each user is trained to understand the difference between answers, or can find additional information on the Bing website.


Search Term Details

Once the user has selected the answer type, they will provide more information for the study including the root search query, how many results are requested, and any specific details for the request. With these settings users are able to run any size study and to make special requests, such as generating new terms that do not include the root query.


Enter Related Queries

Specially trained human judges generate new queries that they also expect to return the given search answer. These users are shown the search query, the returned answer, and any special instructions from the the study creator. The generated results are stored in a database for later analysis by the study organizer.


Analyze Recall Results

Once results have been generated by human judges, the study creator can analyze the recall success rate for each study. The recall success rate is determined by querying Bing for the human-generated queries to determine whether these queries are also returning the given answer as expected.


Manage Active Studies

Study organizers can monitor study progress throughout the course of study. If the results are not trending in a useful direction, the organizer can cancel the study so that resources will be reassigned to more important tasks. The cancel button will ask for confirmation, then cancel the study.


Manage Inactive Studies

Users can find previous studies by search query, date completed, or recall %. Due to a request from study organizers, a clone option was added to start a new study with attributes from a previous study. Users would then typically change one detail and launch a new study.

Mockups

Validation

Human-in-the-loop query generation led to improved recall for rich features returned by the Bing search engine.