by Dr. David Hartzband

Overview of NLP

Natural Language Processing (NLP) has been proposed as a general solution for many data capture and usage issues with respect to EHRs and healthcare information in general. NLP techniques extract information from unstructured clinical data to make it available for analysis. As with any technology, it is not a magic bullet, but it does provide some unique and valuable capabilities to address current and near-future problems[1]. Auto-indexing is one of many technical dimensions of the larger capability of natural language processing (NLP) that are starting to be used in healthcare.

Natural language processing essentially recognizes and extracts data from free text or speech. This might include:

  • Interactive voice recognition – speech to text capability;
  • Optical character recognition – text to data capability;
  • Pattern and image recognition –the ability to identify and classify specific textual patterns (or image patterns).

Operationally, NLP can be applied to classification and categorization of text or speech, sorting patterns into categories (user- defined or system defined) and recognition of specific text patterns (Auto-indexing).

The various types of NLP systems include

  • Rules-based – patterns are discerned based on pre-defined rules;
  • Statistical – patterns are discerned based on the statistical relationships found in a specific text;
  • Hybrid – patterns are discerned based on both a rule-base and statistical analysis;
  • Learned – patterns are discerned based on how machine learning is trained.

NLP systems generally consist of applications, or suites of applications, that provide the following capabilities:

  • Information extraction – for  written text and/or speech;
  • Automatic speech recognition;
  • Machine translation;
  • Dialogue systems – the ability to carry on a conversation with a user.


What is auto-indexing and where does it fit in the NLP context? Auto-indexing is a hybrid type of pattern recognition that uses specific, user- defined terms or vocabulary to identify items of interest. Auto-indexing is used in healthcare information primarily to extract and categorize clinical and demographic information from the text data in EHRs, practice management systems, registries and all types of external reports (labs, hospital discharge, clinical records from HIE or other partners etc.) It requires that terms of interest be mapped from text to structured fields. The defined mapping is generally done by the user and becomes the rule-base for the indexing process. In some cases, there is an existing term set, where the indexing system already has terms that it recognizes, but these terms still require mapping.

Vendor Snapshot

Most EHRs have some kind of document management system that allows them to import and manage documents from outside of the medical record system. EHR vendors are beginning to see that document and text management capabilities will be necessary to optimize their offerings, so we should see more emphasis on this from more vendors over the next few years. In the interim, many EHR vendors are pointing at partners and consultants to provide this capability for their users. A quick survey of several EHRs in use in CHCs today revealed the following:

  • GE Centricity – Auto-indexing is supplied by Kriptiq, a product available from Enli Health Intelligence using technology from InDxLogic. The application utilizes primarily open source components and uses the Apache web server as its presentation and connectivity layers as well as Google Analytics with a customized instruction set to perform its text analysis. It is a relatively modern system (designed in 2005) that provides typical text indexing, categorization and mapping capabilities.
  • Epic – Epic offers custom text management through services and consulting engagements.
  • NextGen – NextGen offers a document management capability that includes import of text documents and images with text extraction through an optical character recognition (OCR) system. Mapping is done separately through services or consulting services. NextGen also recommends such services offered through Dell and a variety of other partners
  • eClinicalWorks – eCW offers the equivalent of indexing and other text management services through a partnership with Nuance.

Nuance, which specializes in speech and text, is more typical of robust modern language processing systems and provides capabilities beyond those typically found in auto indexing systems and in all core NLP areas. The application now has a clinical language understanding (CLU) engine that is specialized so that it already recognizes and understands a wide variety of medical terms including a broad clinical vocabulary and codling schema such as ICD-9,10, SNOMED, LOINC etc. Because terms are built in to the Nuance CLU engine, Nuance does not need terms of interest identified. Texts or speech can be scanned and presented for placement in the EHR. Alternatively, mapping can be defined to provide a specific rule set. Nuance can be used to dictate notes that are automatically entered as data, rather than as text. It can also be used to scan internal (to the EHR) text or external documents. Obviously, work still has to be done to ensure that text is properly placed as data, but this effort is greatly reduced by the system’s understanding of a broad range of clinical terms. There are a number of NLP systems like Nuance available. The current leading vendors in this area for healthcare include 3M, IBM, Cerner, Nuance, Microsoft, Health Fidelity, Apixio, Linguamatics and Dolby Systems. There are even some open source NLP systems available, although using them for indexing and mapping of text into EHRs would also necessitate a great deal of work. These include: Stanford Core NLP, Apache Open NLP, gate, Apache UIMA and Cloudera/Wired Informatics.

It is worth noting what IBM has been doing in this space, and in healthcare applications in general. We have all heard of IBM Watson, the machine learning system that beat former Jeopardy champions at their own game. Watson is designed as a hybrid machine learning system that uses many different technologies to achieve its results. It has very powerful NLP capabilities and can “read” and assimilate enormous amounts of text. It can then synthesize that text, create and test hypotheses based on its understanding and produce explanations of what it has done. It is “trained” by assimilating material in a specific area. IBM Watson Healthcare, in conjunction with the Sloan Kettering Cancer Treatment Center, has now trained their system for use in cancer diagnosis and treatment planning. Watson can now examine a person’s electronic record (including text and images), determine a diagnosis for specific types of cancer and create a treatment plan based on the body of knowledge that it assimilated. In partnership with the International Diabetes Federation and Medtronic, Watson Healthcare is looking at Type 2 diabetes next. Not everyone has the resources that IBM has put in to Watson, but IBM is expected to allow use of these systems by healthcare organizations in the near future. In addition, some large healthcare organizations, such a Kaiser Permanente and Geisinger, have developed their own point-of-care recommendation systems that include substantial NLP capabilities.

Applying Auto-Indexing in Practice

Auto-indexing is currently very useful; especially as CHCs deal with larger quantities of complex text data where having the capacity to evaluate and classify free text becomes increasingly important. EHR vendors are starting to address the issue of text extraction and assimilation in their systems and can advise, and in many cases provide services, to facilitate this capacity. It is important to note, however, that auto indexing is neither “smart” nor predictive; it relies on user-defined mapping that must be carefully implemented to have value. Deeper and larger scale natural language capabilities, including those based on machine learning like IBM Watson, will become more important and prevalent in healthcare over the next few years (3-5), and CHCs will need to keep up with this technology as inputs to EHRs and other HIT systems become more complex and  contain more text as well as structured data.