Why meaning extraction?
Every year, for enterprise clients, Northern Light provides aggregation and search for 750,000 market research reports with a value of $1 billion (if you bought each report at its list price) from 80 of the leading analyst firms. On a late night two summers ago I was sitting alone, brooding about the future of information technology, as I often do. The thought occurred to me that if I could read every report we aggregate each year, then I would be a lot smarter about this question, or at least much better informed. Well, I mused, I cannot read 750,000 reports, but the computer can.
In that moment, the future just reached back and hit me. What if search engines could read all the market intelligence documents a researcher has access to, identify the business issues reported on, suggest the trends, flag the threats, highlight the opportunities, and distinguish those documents that are the most important, not from a search relevance perspective, but from a meaning perspective?
For example, what if you could conduct a search on one of your product lines and have the search engine zero in on the reports that describe threats to your company’s market share or pricing strategy? What if you could feed the search engine a company name and have it provide you not with a list of documents, but with a report that highlights the company’s corporate strategy, business position, and opportunities and challenges?
This thought lead us to launch the development project that culminated in MI Analyst, which is to my knowledge the first search engine providing automated analysis and discovery of business meaning from large stores of market intelligence content.
In the past, there have been many attempts to improve the intelligence of search engines, and these attempts have generally failed. For example:
- Increase the size of the database. All users and search engine journalists believe that bigger is better in search engine databases. If one is looking for a fact, the bigger the database the better the chance of finding the fact. For example, if I want the phone number of my local pharmacy, I hope the search engine searches billions of documents and that the results summary of the first hit has the number I need, hopefully with the phone number in bold type. But unlimited raw document count actually hurts the goal of the search if one is looking for analysis, commentary, and perspective on, say, a business issue. The more documents searched that contain uninformed opinion, the lower the density of quality search results, which means the user never sees some or a lot of the best material from the most informed commentators. Database size has failed to make search engines more intelligent, and in many cases makes them dumber.
- Expand the search semantically. The argument put forward by the proponents of semantic search is that related terms can improve the recall of relevant documents. So, far example, if an analysis of the corpus of material reveals that “federal budget” is often found with “federal spending,” then a search on “federal budget” that is expanded by the search engine to include a search for “federal spending” will find documents that have just the second term but not the first. The argument of supporters of semantic search is that it helps users by giving them this extra increment of documents. The problem here is that the search engine is not doing anything more than expanding the list of search results, and most searches result in far more hits than any user has a chance of considering. If the user is only going to look at 10-30 hits anyway, what exactly is the value of expanding the total list of hits from 500,000 to 600,000? Zero, I would argue. Semantic search does not make search engines smarter, just more daunting to work with.
- Parse the query with natural language processing. The idea here is that if people could ask questions in plain English that the search engine could parse, more on target results could be returned. After all, if you ask a human researcher a question in English, you get back a cogent reply, so why not apply this idea to search engines? But I would submit that the designers of natural language parsing solutions are not examining the actual reason why natural language questioning of a human researcher actually works. It is not that the researcher simply understands the question; it is that the researcher takes the question and then examines relevant material, identifies the trends, summarizes the issues, and constructs a conceptual framework for relating bits of information back to the original question. These are the real things that make the process work when a human researcher is asked a question in natural language. What search engine natural language front-ends do is take the user’s question, parse it, intelligently structure the question into a well-formed Boolean query, and then turn it over to the usual search process to spawn a long list of hits. All of the intelligent processing is done on the question, not on the answer. The user gets back a traditional search engine results page, and is abandoned to the usual process of manually examining the documents in order figure out what they mean. Natural language processing does not make search engines more intelligent, it just makes it easier to communicate with the dumb back-end.
- Reorganize the UI or trick-up the results list display. We have all witnessed the stream of UI tweaks from one company or another like showing hits from images and videos as well as text pages. Or how about graphical tools for displaying search results, for example, as blobs with connecting lines showing related websites or subjects? The fact is that these efforts are superficial; the search engines in quesiton are not interpreting the documents meaningfully, they are just finding jazzy ways of showing you the same old dumb results list of documents for you to examine.
Pretty much, all of these efforts have failed to transform the search experience when one is trying to understand a complicated subject in depth. One of my favorite marketing slogans is the ever popular stop searching and start finding, which has been used by dozens of companies over the years (marketing people sometimes appear to have no “industry memory” so they keep inventing the same marketing campaign over and over). Despite all this flash and smoke, search today is pretty much like it was in 1994. You enter a search term, get a list of documents, and then have at it, personally examining each document one at a time trying to sort it all out. While there have been improvements in relevance ranking, and in the ability of search engines to find facts due to database size, the ability of search to assist in meaningful analysis is at its 1994 level, or at least was until recently when applications involving text analytics like MI Analyst started to appear.
I know of two applications other than MI Analyst that turn search into an analytical process. One is a pharmaceutical solution that identifies candidate pathways for drug research by looking for terms that are near each other, the other is available on the Web: Google’s research project on flu trends at http://www.google.org/flutrends. Flu Trends looks for searches on terms like “flu,” “cough,” “sneeze,” etc. and using the frequency of the search terms and IP addresses of the searchers constructs trend and geographic information on the progress of the annual flu epidemic. Google has announced that they worked with experts at the CDC in selecting the terms to analyze. When you use Flu Trends, you do not get a list of documents to read, you get an analysis of user behavior on the search engine with charts, graphs, and maps that is a proxy for the underlying flu season.
There is one component of text analytic solutions that work: they are built (or assisted) by people that understand the research purpose of the search and who can use the search process to facilitate that research purpose by providing frameworks, analytic routines, and algorithms for interpreting textual information. This is not the job of horizontal, one technology fits all search companies. What does one care about when researching a competitor, how does one tell that a new treatment strategy might work, where are the problems and how severe are they? These are questions of analysis, not of search, or at least of just search.
Search engines must evolve to have in-depth understanding of the searched material. Beyond search, categorization, faceted navigation, and entity extraction, which we all understand by this point, the future of search is meaning extraction.