Three open source NLP tools for extracting data

Developers and data scientists use generative AI and large language models (LLMs) to query large volumes of documents and unstructured data. Open source LLMs include Dolly 2.0, EleutherAI Pythia, Meta AI LLaMa, and StabilityLM, etc. They are the starting point for trying artificial intelligence, which can accept natural language prompts and generate summary responses.

Fluree CEO and co-founder Brian Platz said: “Text is important as a fundamental source of knowledge and information, yet there are no end-to-end solutions that can harness the complexity of processing text. While most organizations deal with structured Or semi-structured data, put it on a centralized data platform, but unstructured data is still forgotten and underutilized."

If your organization and team aren't experimenting with natural language processing (NLP) capabilities, you could be falling behind the competition in your industry. The 2023 Expert NLP Survey report found that 77% of organizations said they plan to increase spending on NLP, and 54% of organizations claimed that time to deployment to production environments is the top metric for measuring the ROI of successful NLP projects.

Use cases for NLP

If you have large amounts of unstructured data and text, some of the most common business requirements include the following:

  • Extract entities by identifying names, dates, locations and products;
  • pattern recognition to discover currency and other quantities;
  • Categorize business terms, topics and taxonomies;
  • Sentiment analysis, including positive, negative and sarcastic sentiment;
  • Summarize the key points of the document;
  • machine language translation into other languages;
  • A dependency graph that converts text into a machine-readable semi-structured representation.

Sometimes it is desirable to bundle NLP capabilities into a platform or application. For example, LLM supports questioning, AI search engine supports search and recommendation, and chatbot supports interaction. Other times, using NLP tools to extract information and enrich unstructured documents and text is the best option.

Take a look at these three popular open source NLP tools that developers and data scientists use today to perform discovery operations on unstructured documents and develop production-ready NLP processing engines.

1. Natural Language Toolkit

Released in 2001, the Natural Language Toolkit (NLTK) is one of the older and more popular NLP Python libraries. NLTK has more than 11,800 stars on GitHub and lists more than 100 trained models.

Steven Devoe, director of data and analytics at SPR, said: "I think the most important tool for NLP is the Natural Language Toolkit (NLTK), which uses the Apache 2.0 license. In all data science projects, processing and cleaning The data used by the algorithms consumes a lot of time and effort, especially in natural language processing. NLTK speeds up many tasks in this area, such as stemming, etymology, tokenization, removal of stop words, and across a variety of Written language embeds word vectors, making it easier for algorithms to interpret the text."

NLTK's strength stems from its durability, and it provides many examples for developers new to NLP, such as this hands-on guide for beginners and this more comprehensive overview . Anyone learning NLP techniques might want to give this library a try first, as it provides easy ways to try out basic techniques like tokenization, stemming, and chunking.

2. spaCy

spaCy is a newer library, with version 1.0 released in 2016. spaCy supports more than 72 languages, has published its performance benchmarks, and has amassed more than 25,000 stars on GitHub.

Nikolay Manchev, Head of Data Science for the EMEA region at Domino Data Labs, said: "spaCy is a free, open-source Python library that provides advanced capabilities for high-speed natural language processing of large volumes of text. With spaCy, users can build models and production-grade applications that support document analysis, chatbot functionality, and all other forms of text analysis. Today, the spaCy framework is one of the most popular natural language libraries for Python for extracting keywords, entities, and knowledge from text and other industry use cases.”

The spaCy tutorial shows similar capabilities to NLTK, such as named entity recognition and part-of-speech tagging. One advantage is that spaCy returns document objects and supports word vectors, which can give developers more flexibility to perform additional post-NLP data processing and text analysis.

3.Spark NLP

If you already use Apache Spark and configure its infrastructure, then Spark NLP may be one of the easier ways to start experimenting with natural language processing. There are several installation options for Spark NLP, including AWS, Azure Databricks, and Docker.

David Talby, CTO of John Snow Labs, said: "Spark NLP is a widely used open source natural language processing library that enables companies to extract information and answers from free text documents with the highest accuracy. Relevant health information in clinical records, identifying hate speech or fake news on social media, or outlining legal agreements and financial news.”

What makes Spark NLP different is that it is a language model for medical, financial and legal domains. These commercial products come with pre-trained models for identifying drug names and dosages in the medical domain, financial entity recognition (such as stock ticker information), and legal knowledge graphs of company names and executives.

Spark NLP can help organizations minimize the upfront training required to develop models, Talby said. "This free and open-source library comes with over 11,000 pre-trained models, plus features for reusing, training, tuning, and easily extending models," he said.

 

Try out best practices for NLP

Early in my career I had the privilege of overseeing the development of several SaaS products built with NLP capabilities. The first NLP is a SaaS platform that searches newspaper classifieds, including searches for cars, jobs, and real estate. I then led the development of NLP for information extraction from commercial construction documents, including building specifications and blueprints.

When starting NLP in a new field, my advice is as follows:

  • Start with a small, representative example of a document or text.
  • Identify target end user personas and how the extracted information can improve their workflows.
  • Specify the desired information extraction and target accuracy metrics.
  • Test several methods, benchmarking using speed and accuracy metrics.
  • Repeatedly improve accuracy, especially as you increase the size and breadth of your document.
  • Prepare to deliver data management tools for handling data quality and handling exceptions.

You may find that NLP tools for discovering and trying out new types of documents will help define requirements. Then, expand the scope of the comparison of NLP techniques to cover both open source and commercial options, since building and supporting a production-ready NLP data pipeline can be costly. With LLM gaining traction, underinvesting in NLP capabilities can lead to falling behind competitors. Fortunately, you can start with one of the open source tools described in this article to build an NLP data pipeline to suit your budget and needs.

Guess you like

Origin blog.csdn.net/wangonik_l/article/details/131984058