Most applications today still work in the world of processing data from structured and semistructured sources. They connect to SQL databases to query information or present information from JSON or XML data sources. Many applications still avoid the complication of parsing and extracting knowledge from unstructured sources such as open text fields, rich text editors, database CLOB (character large object) data types, social media news streams, and full documents from tools like Microsoft Word, Google Docs, and Adobe Acrobat.
But the world of information is largely unstructured. People enter, search, and manage information in a myriad of tools and formats. Modern applications are going beyond just storing and retrieving unstructured information and are incorporating elements of natural language processing (NLP) to improve user experiences, manage complex information, enable chatbot dialogs, and perform text analytics.
What is natural language processing (NLP)
NLP engines are designed to extract data, information, knowledge, and sentiment from blocks of text and documents. They often use a mix of parsing technologies, knowledge data structures, and machine-learning algorithms to extract and present information in comprehensible formats to both people and downstream applications.
NLP engines typically have the following technical components:
- API and data storage interfaces to make it easy to connect to data sources and aggregate information for analysis.
- File parsers that extract text, metadata, and other contextual information from different file types and document storage formats.
- Document parsers that break down documents into more atomic units including sections, paragraphs, sentences, phrases, and words.
- Pattern-recognition tools such as a regular expression parser to identify patterns such as dates, currencies, phone numbers, and addresses.
- Dictionaries and other knowledge-storage tools that can help NLP engines identify entities such as names, places, and products.
- Tools and machine-learning algorithms to aid in the creation of domain-specific entities, topics, and terms.
- Semantic and other contextual analysis functions that provide deeper analysis. Is the paragraph positive, negative, or neutral about the subject? Was the paragraph adjacent to a photograph that provides additional context? Was the document found in a folder or have links to other documents that can provide additional context? What is known about the document’s author and when the document was written that can provide additional context?