Using Meaning Extraction To Improve Search Results
In my prior post, “Meaning Extraction for Business Strategy”, I describe a new type of search result: “scenarios of meaning.” For example, in a search on ‘Cisco and VOIP’, we get back the scenario “Cisco is using a product marketing strategy of Professional Services” that presents a finding about how Cisco is using a particular marketing initiative in that market. This new type of search result is unique and exciting. But the story doesn’t really stop there. Meaning extraction can be used to bump up the value of plain old search results as well. Let me explain.
Plain Old Search Results
Let us consider the ubiquitous search result. Sometime around, oh 1994, the search result made its debut with the Web search engines Lycos, InfoSeek, Excite, and Alta Vista. A user would input words that the search engine would compare to an index of the words in Web pages (or “documents” in industry parlance), generate a list of documents that matched the user’s inputted words, relevance rank the document list with a secret formula, and then present the list to the user to browse along with a little summary of each document. The list of relevance-ranked, summarized documents – when translated into an HTML page and delivered to the user’s browser – constituted search results as the search engine industry conceived them in 1994, or as I like to call them, plain old search results.
Sound familiar? Of course it does, because not much has changed with search results since 1994. The indexes have gotten bigger, the secret formulas for relevance ranking are now better, and the techniques used to generate the summaries have improved. But the 1994 format for search results is with us today. Google has supplanted the industry’s pioneers for Web search, Microsoft and Autonomy have established the enterprise search market, and Northern Light has defined search for research portals, but we later innovators all adopted the established, traditional structure for search results. Name another software or web application user interface that is unchanged since 1994! Come on search engine industry, we can do better than this. It is time to dump plain old search results.
What is the pressing need to do better? Well, the present structure of search results, when a user is performing research as opposed to just looking for a fact like a phone number, just doesn’t help the user enough. The little summaries of each document are often not sufficient to judge how helpful the document will be to the user. This is especially true for substantive documents like journal articles, market research reports, patents etc.
For example, in Northern Light’s database of information technology analyst firm research reports, the median report length is five pages, and many have 25 pages; some have 50 pages. Typical four-line summaries of such reports do not reveal very much about such documents. What traditional search results do is force the user to guess as to what might be relevant, download a few and try to scan them quickly looking for helpful indicators of whether the document will be useful for the business purposes of the search. The user persists at this until he or she finds something that might be useful or becomes frustrated by the process and abandons it.
Our portal logs suggest that most business researchers on a serious project will actually download between tw0 and four business research documents from the typically 10,000+ hits on a plain old search result and make do with those. So as users, we are not all that persistent in this interrupt-driven business world.
One of the really great things about search applications enhanced with meaning extraction is that we learn a very great amount about each document while we are indexing it. The industry term for what a text analytics application identifies in a document is entities and the process is called entity extraction. Almost always, what the industry means by the word entity is really proper noun. The big three are company names, people names, and locations. You see those discussed over and over at industry conferences and in sales literature. Interface an extraction of those three items to a sentiment scoring engine and (presto!) you have a nice generalizeable horizontal reputation management application that you can sell to companies in any industry. At Northern Light, we just don’t think this approach makes enough difference.
Northern Light also looks for company names. But after that our entity extraction jumps to a different level. We extract references to conditions, circumstances, events, technologies, strategies, trends, and outcomes that have significance to the business purpose of the search. Northern Light has a special term we have coined for these types of entities: meaning-loaded entities. Examples of meaning-loaded entities would include price cut, market share gain, credit crisis, acquisition strategy, brand and customer loyalty, energy price increase, and government bailout. In the life sciences arena, meaning-loaded entities includes concepts like diseases, drugs, protiens, and clinical trials. Currently, Northern Light’s meaning extraction application, MI Analyst, has around 20,000 meaning-loaded entities in the taxonomy.
Using MI Analyst to index a document, we learn all kinds of interesting things about what is in each document. For example not only can we find out what company names are mentioned, but also what technology trends are discussed, and how business issues are affecting the players in a market. Or, in a life sciences setting, we can learn what diseases, drugs, and therapeutic strategies are discussed in a journal article.
We also know where each of these items is within the document, which gives us an opportunity to assess how related they are to one another, and to find scenarios of meaning when meaning-loaded entities are close to one another. (Northern Light calls these scenarios of meaning either Business Scenarios or Life Sciences Scenarios depending on the context). For example, if we find Cisco near Strategic Partnerships near Internet Telephony Market we may have just learned something significant. Or if we find the drug Taxol near Phase I Clinical Trial near Lung Neoplasms (“cancer” for you non-life sciences types), this might suggest a new avenue of research.
Once we know all these things about a document, why not use that intelligence about the document to help our poor user we left three paragraphs up above in this post still struggling with plain old search results?
Changing the Structure of Search Results
Having recorded the presence of meaning-loaded entities, Business Scenarios, and Life Sciences Scenarios in each document, it is time to rethink this 1994-vintage search results that started us down this path. One of the really nice things about research applications, as opposed to general Web search engines, is that we know the business purpose of the search. It might be a search engine of market research reports for product managers to help determine new features in an information technology company, or a competitive intelligence monitoring tool, or a pharmaceutical research application. Given this, we can select the types of meaning-loaded entities that will be most helpful to the users.
For example, in a market research setting, Companies, Venture-Funded Companies, Technologies, and Business Issues might be most helpful. Then as each search is performed by users, the search result for each document can be enhanced to show the selected entities which can raise the value of search results, as in the example below based on a search of “VOIP’ on an information technology analyst database.
Plain Old Search Results
1. Measuring and Diagnosing VoIP Voice Quality
97%,Licensed Content, 03/22/2009
Voice over Internet Protocol (VoIP) places strict requirements on the network infrastructure. If there are any problems in network configuration or operation, voice quality suffers and users complain. This report, updated by Senior Analyst Tom Smith to incorporate the latest changes in measurement standards, technologies, and vendor products, examines the tools and techniques for both pre-deployment testing and post-deployment problem detection and diagnosis of voice quality. Report number 98736
2. Cable Voice Brings VoIP Into the Mainstream
91%, Licensed Content, 10/06/2008
While pure-play and telco voice over IP (VoIP) providers continue to struggle to win subscribers to their over-the-top (OTT) services, cablecos have been wildly successful in bringing VoIP technology to 12% of US consumers. What’s more, pure-play and telco VoIP users remain the same niche early adopters who’ve been using VoIP for the past three years, while cable voice subscribers are more mainstream. – This is report number 149147
Meaning-Loaded Search Results
1. Measuring and Diagnosing VoIP Voice Quality
Licensed Content, 03/22/2009
Voice over Internet Protocol (VoIP) places strict requirements on the network infrastructure. If there are any problems in network configuration or operation, voice quality suffers and users complain. This report, updated by Senior Analyst Tom Smith to incorporate the latest changes in measurement standards, technologies, and vendor products, examines the tools and techniques for both pre-deployment testing and post-deployment problem detection and diagnosis of voice quality.
Business Issues mentioned: Legacy Systems (2), Competitors (1), High Growth Product Market (1) more
Companies mentioned: Cisco Systems Inc (38), Nortel (21), NetIQ (9) more
Technologies mentioned: Voice Over IP (VoIP) (195), Private Branch Exchange (PBX) (45), Simple Network Management Protocol (SNMP) (13) more
2. Cable Voice Brings VoIP Into the Mainstream
Licensed Content, 10/06/2008
While pure-play and telco voice over IP (VoIP) providers continue to struggle to win subscribers to their over-the-top (OTT) services, cablecos have been wildly successful in bringing VoIP technology to 12% of US consumers. What’s more, pure-play and telco VoIP users remain the same niche early adopters who’ve been using VoIP for the past three years, while cable voice subscribers are more mainstream. Report
Business Issues mentioned: Product Strategy and Roadmap (8), Benchmarks (4), Customer Demand (1) more
Companies mentioned: Time Warner Inc (4), Comcast Inc (3), Verizon Communications (2) more
Venture Funded Companies mentioned: PurePlay (1) more
Technologies mentioned: Quality of Service (QoS) (1), Bundled Services (1), Mobile Phones (1) more
As you can see from the above example, the second set of search results is way more helpful. Take the first document for example: with meaning-loaded search results we learn important new facts about the report that the plain old search results did not reveal. We learn that Legacy Systems is an important issue and that PBX is a technology that is mentioned often in the report. Bing – the light bulb goes on over our heads. It suddenly dawns on us that in the enterprise VOIP market how to deal with the omnipresent legacy PBX system with its epicenter right at the front reception desk could be a major issue.
Also, right from the search results I learn that Cisco and Nortel are companies to pay attention to in the enterprise VOIP market. Bing bing.
Glancing at the second hit’s meaning-loaded result, I learn that this report most likely will lay out the Product Strategy and Roadmap for cable providers Time Warner and Comcast. Bing, bing, bing – my interest in this report just jumped off the scale if I am a product manager for an IT company that makes VOIP network gear.
Note that with plain old search results none of this intelligence from either document comes through at all. We have the little summaries that provide a whisper of what the documents are about, but the additional information that both taught me things and got me interested in reading more was absent. Now, finally, we can dump the structure of search results first used for users of search engines in 1994 and can make search results way more useful. Northern Light believes that meaning-loaded search results are a game-changer for users.
Lastly, at this point my dear reader, you may be wondering what the heck the “more” link in each meaning loaded-results above connects, to. The answer is that they link to meaning-loaded document summaries. But to avoid this blog post from growing to War and Peace proportions (if it hasn’t already), I will save that discussion for the next post.