《Python自然语言处理-雅兰·萨纳卡(Jalaj Thanaki)》学习笔记:07 规则式自然语言处理系统

版权声明:本文为博主原创文章,未经博主允许不得转载。欢迎加本人实名微信(Zhou_zhongliang),一起学习,共同进步。 https://blog.csdn.net/weixin_43935926/article/details/86736712


NLP应用程序的算法(实现技术或方法)可分为两部分。在本章中,我们将重点介绍基于规则的系统。我们将接触到以下主题:

了解RB系统
安装RB系统的目的
RB系统的体系结构
了解RB系统开发生命周期
应用
使用RB系统开发NLP应用程序
比较RB方法和其他方法
优势
缺点
挑战
RB系统的最新趋势

7.1 规则式系统

RB系统也被称为基于知识的系统。但首先,我们将看到RB系统的含义和它对我们有什么作用?使用这种方法可以实现哪种NLP应用程序?为了更好地理解,我将在应用程序的帮助下解释这些概念。基于规则的系统是利用现有的知识或规则来定义的,我们开发了一个使用规则的系统,将现有的系统规则应用到语料库上,并尝试生成或推理结果。请参阅图7.3,它将向您介绍RB系统:
Alt
简而言之,你可以说RB系统就是将现实生活中的规则或经验应用到一个可用的语料库中,根据规则操纵信息,并得出某些决定或结果。在这里,规则是由人类产生或创造的。RB系统用于以有用的方式解释可用的语料库(信息)。在这里,规则作为RB系统的核心逻辑。语料库是根据规则或知识来解释的,所以我们的最终结果取决于这两个因素,一个是规则,另一个是语料库。现在将解释一个人工智能应用程序,以获得RB系统的核心本质。

作为人类,我们每天都要做非常复杂的工作来完成一些任务。为了完成任务,我们使用我们以前的经验或遵循规则来成功完成任务。

举个例子:如果你在开车,你会遵守一些规则。你事先知道这些规则。现在,如果你想到自动驾驶的汽车,那么这辆车应该做出反应或者完成人类以前做的全部任务。但是汽车不知道如何在没有司机的情况下自动驾驶。开发这种无人驾驶汽车既复杂又具有挑战性。

不管怎样,你想创造一辆自动驾驶的汽车。你知道,有这么多的规则,汽车需要学习,以执行作为一个人类司机。这里有几个主要挑战:

  • 这是一种复杂的应用程序
  • 汽车需要学习很多规则和情况
  • 自动驾驶汽车的准确度应该足够高,以便向消费者投放市场。

因此,为了解决这些挑战,我们采取了各种步骤:
1、我们首先尝试将问题语句简化为问题的小部分,这是原始问题语句的一个子集。
2、我们先解决一小部分问题。
3、为了解决这个问题,我们试图提出一个通用的规则来帮助我们解决我们的问题,同时帮助我们实现我们的最终目标。

对于我们版本的无驾驶员(自动驾驶)汽车,我们需要从软件的角度进行思考。那么,汽车应该学习的第一步是什么?想想!

汽车应该学会看到和识别道路上的物体。这是我们汽车的第一步,我们定义了一些通用规则,汽车将使用这些规则来学习和决定道路上是否有物体?然后,以此为基础开车。当看到路况时,汽车的速度应该是多少?等等。

对于我们任务的每一个小部分,我们都试图定义规则并将该规则逻辑输入到RB系统中。然后,我们检查在给定的输入数据上该规则是否正确地工作。我们还将在获得输出后测量系统的性能。

现在,你一定认为这是一本关于NLP的书,那么为什么我要给出一个通用人工智能应用的例子呢?其背后的原因是,自动驾驶汽车的例子很容易联系起来,并且每个人都能理解。我想强调一些要点,它们也有助于我们理解建立基于规则的系统的目的。

让我们举一个一般的例子,并理解其目的:

  • 这个自动驾驶汽车的例子有助于你认识到,有时人类很容易完成的任务对于机器自己来说要复杂得多。
  • 这些复杂的任务需要高精度!我的意思是非常高!
  • 我们不希望我们的系统覆盖并了解所有情况,但是无论我们输入系统的规则是什么,它都应该以最好的方式了解这些情况。
  • 在RB系统中,各种场景的覆盖率较低,但系统的精度较高。这就是我们需要的
  • 我们的规则来源于现实生活中的人类经验或利用人类的知识。
  • 规则的制定和实施是由人类完成的

所有这些点帮助我们决定何时何地使用基于规则的系统。这让我们定义了拥有一个基于规则的系统的目的。那么让我们进入下一节,在这里我们定义一个经验法则,用基于规则的方法来处理任何NLP或相关的应用。

7.2 规则式系统的目的

通常,基于规则的系统用于开发NLP应用程序和通用人工智能应用程序。我们需要回答的问题有很多,才能清楚地了解基于规则的系统。

7.2.1 为何需要规则式系统

基于规则的系统试图为NLP应用程序模拟人类专家知识。这里,我们将讨论有助于您理解RB系统目的的因素:

  • 可用的语料库大小很小

  • 输出过于主观

  • 对于特定领域的人来说,很容易生成一些专门的规则

  • 机器很难通过观察少量数据生成专门的规则

  • 系统输出应高度准确如果您想使用RB系统开发NLP应用程序,那么前面的所有因素都非常关键。上述因素如何帮助您决定是否应选择RB方法?您需要问以下问题:

  • 您有大量的数据还是少量的数据?如果你有少量的数据,那么问下一个问题,如果你有大量的数据,那么你还有很多其他的选择。

  • 对于您要开发的NLP应用程序,其输出是主观的还是广义的?
    如果你有少量的数据,并且你想开发的应用程序的输出过于主观,你知道,如果有少量的数据,机器就不能归纳出模式,那么就选择RB系统。

  • 您要开发的NLP应用程序应该具有非常高的准确性:
    如果您要开发的应用程序应该具有高精度,几乎与使用小数据集的人相同,那么选择RB系统。
    在这里,您还应该记住,人类专家为系统创建规则。根据该系统,生成输出,因此RB系统是高度精确的,但不涵盖所有场景。

前面的问题定义了为什么以及在什么情况下我们可以使用RB系统。如果需要总结前面的问题,我会这样描述:如果您有少量的数据,并且您知道您需要一个高度精确的系统,在这个系统中,人类专家很容易识别出各种场景来制定规则及其输出,但是机器很难自己识别出一般化的规则,而且精确,那么RB系统就是为你准备的!RB系统的输出应模仿人类专家的经验。这是选择RB系统的经验法则。

7.2.2 使用规则式系统的应用

正如我们前面定义的,RB系统是在人类领域专家的帮助下开发的。让我们在本节中举几个例子来证明我们的经验法则:比如说,我们想建立一个从英语到印度语料库的机器翻译系统,但是它们太小了。翻译系统应足够精确,以便开发。我们需要既懂古吉拉特语又懂英语的人类专家。我们不想一次处理所有不同级别的翻译,因此我们需要先处理小部分的问题,然后在开发的原型之上,我们将构建其他部分。所以,在这里,我还想选择RB系统。你怎么认为?假设我们想为英语开发一个语法修正系统。假设我们有少量的平行语料库(有语法错误的文档和没有语法错误的相同文档),通过使用现有的语料库,我们需要做出一个准确的语法更正应用程序,来识别和更正语法错误。那么,在这种应用程序中,您会采用哪种方法呢?想一想,然后提出你的答案!这里,我想按照我们的经验法则使用RB系统。

7.2.3 练习

如果你想开发一个基本的聊天机器人系统,你会采用哪种方法?
1.RB法
2.ML方法

如果你想预测给定句子的情感,你会采取哪种方法?
1.RB法
2.ML方法
3.混合方法
4.他们中没有一个

7.2.4 开发规则式系统需要的资源

现在您已经了解了我们为什么要使用RB系统,以及我们将它用于哪种应用程序。第三个重要方面是,如果我们想为任何NLP或AI应用程序开发RB系统,我们需要什么?

目前我们需要考虑三种主要资源。参见图7.4:现在,让我们看看每个资源的详细信息,这些资源帮助我们定义RB系统组件:
领域专家(人类专家/知识专家):为了使用RB系统开发应用程序,首先,我们需要一个领域专家,一个对该领域几乎无所不知的人。
假设您想要构建一个机器翻译系统,那么您的领域专家可能是一个对源语言和目标语言有深入语言学知识的人。他可以利用自己的专长和经验制定规则。

RB系统的系统架构师(系统工程师):为了定义RB系统的体系结构,您需要一个具有以下专业知识的团队或人员:
领域基础知识
在设计系统架构方面有丰富的知识或经验
体系结构是RB系统最重要的部分,因为您的体系结构是决定整个系统效率的组件之一。RB系统的良好架构设计将提供良好的用户体验、准确和高效的输出,除此之外,它还将使编码人员和其他技术团队(如支持或测试团队)的合作变得容易,这些团队将能够轻松地在系统上工作。系统架构由系统工程师或系统管理员负责。

实现规则的编码人员(开发人员或知识工程师):一旦领域专家开发了规则,并且系统架构得到了正确的设计,那么编码人员或开发人员就会出现在图中。编码员是我们真正的忍者!它们通过使用编程语言来实现规则,并帮助完成应用程序。他们的编码技能是RB系统中非常需要的部分。编程可以使用任何编程或脚本语言(如C、C++、Java、Python、Perl、shell脚本等)来完成。您可以根据体系结构使用它们中的任何一个,但不能在没有流线型体系结构的单个系统中使用所有这些。

7.3 规则式系统的架构

我将通过将RB系统分为三个部分来解释其体系结构:

作为专家系统的RB系统的一般体系结构

用于NLP应用的RB系统的实用架构

自定义体系结构-用于NLP应用程序的RB系统

Apache UIMA(非结构化信息管理体系结构)用于NLP应用程序的RB系统

7.3.1 从专家系统的角度来看规则式系统的通用架构

如果我们将基于规则的系统描述为一个专家系统,那么这种基于规则的系统的体系结构将与图7.5所示相同:
Alt
让我们详细看看体系结构的每个组件:

领域专家:
正如我们在前一节中看到的,领域专家是那些对特定领域有专门知识的人,他们可以帮助我们生成规则来解决我们的问题。

开发人员或知识工程师:
开发人员使用领域专家创建的规则并转换
利用他们的编码技巧将它们转换成机器可理解的格式
开发人员对专家创建的规则进行编码
大多数情况下,这种编码是以伪码的形式进行的。知识库:

知识库是专家们制定规则的地方。
域专家可以添加、更新或删除规则

数据库或工作存储:
所有与元信息相关的规则都可以放在工作存储器中
在这里,我们可以存储规则以及特殊的场景,如果有的话,还可以存储一些列表,例如,等等
我们还保存了要应用规则的数据

推理机:
推理机是系统的核心部分。
在这里,我们为我们的规则输入了实际的代码
当预先定义的规则和条件满足用户时,将触发规则。
查询或对我们作为输入提供给系统的数据集

用户推断:
有时,我们的最终用户还提供一些条件来缩小结果,所以当我们的系统生成输出

用户界面:

用户界面帮助我们的用户提交他们的输入,作为回报,他们将得到输出。这为我们的最终用户提供了一个交互式环境

系统架构师:
系统架构师负责系统的整个体系结构
系统架构师还决定RB系统

7.3.2 NLP应用中的规则式系统的实用架构

现在我们将看到用于NLP应用程序的RB系统的实际体系结构。参见图7.6:
Alt
一些部分,如领域专家、用户界面和系统工程师,我们在前一节中已经看到。因此,在这里,我们将重点放在新组件上:

基于知识的编辑器:
领域专家可能不知道如何编码。因此,我们为他们提供了一个基于知识的编辑器,他们可以使用人类语言编写或创建规则。假设我们正在为英语开发一个语法修正系统,我们有一个语言学家,他知道如何创建规则,但不知道如何编写规则。在这种情况下,他们可以使用基于知识的编辑器添加、更新或删除规则。所有创建的规则都以正常人类语言的形式指定。

规则转换器:
正如我们所知,所有规则都是以人类语言的形式存在的,所以我们需要将它们翻译或转换成机器可理解的形式。所以,规则转换器是一个部分,在这里用例子定义了规则的伪逻辑。让我们考虑一下我们的语法修正系统示例。这里是我们的专家,如果句子中有单数主语和复数动词,则定义规则,然后将动词更改为单数动词格式。在规则转换器中,定义的规则已转换为句子s的单数主语带有pos标记prp$,np带有pos标记的动词vbp,然后将动词改为vbz格式。还指定了一些示例来理解这些规则

规则对象类:
此规则对象类用作支持库的容器。它包含各种先决条件库,它有时还包含一个可选的对象类,用于库优化整个系统。对于语法纠正系统,我们可以将诸如解析器、pos标记器、命名实体识别(ner)等工具放在规则引擎要使用的容器中。

数据库或知识库:
数据库具有用于规则的元数据,例如:规则对象类中使用了哪些支持库?规则的类别是什么?规则的优先级是什么?

规则引擎:
这是核心部分,是RB系统的大脑。通过使用规则转换器、规则对象类和知识库,我们需要开发在用户查询或输入数据集上实际运行的核心代码,并生成输出。您可以使用任何最适合您的应用程序及其体系结构的编程语言进行编码。对于我们的语法修正系统,我们将在这个阶段对规则进行编码,最终的代码将被放入规则引擎存储库中。

7.3.3 NLP应用中的规则式系统的定制架构

根据不同的NLP应用程序的需要,您可以更改体系结构或组件。在这种方法中,定制是可能的。如果您正在设计一个定制的RB系统架构,那么需要注意一些要点。提出以下问题:

您是否分析和研究了问题以及已经存在的架构?在进行定制之前,您需要对应用程序进行分析。如果存在任何现有的系统,那么就花足够的时间进行分析,研究它的体系结构,并从中剔除坏的和好的。

你真的需要定制架构吗?如果在研究之后,您觉得您的应用程序架构需要定制,然后写下你真正需要它的原因。说明列出的原因,并通过问一系列问题来帮助系统改进。如果是,那你就走对了

它是否有助于简化开发过程?
新的体系结构真的能更好地帮助您的开发过程吗?如果是这样,那么您可以考虑该体系结构
大多数时候,为开发RB系统定义一个流线型的过程是很有挑战性的,但是如果您的新的定制架构可以帮助您,那么这真的是一件好事。这个简化的过程实际上稳定了你的RB系统吗?

是否可维护?
如果您可以将这个特性添加到您的定制架构中,然后竖起大拇指,那么定制的架构可以帮助您轻松高效地维护系统!

它是模块化的吗?
如果它将在RB系统中提供模块化,那么它将非常有用,因为这样您就可以轻松地添加、删除或更新某些模块。

它是可扩展的吗?
在新体系结构的帮助下,您可以扩展系统。你也应该考虑这个

迁移容易吗?
如果它具有定义的体系结构,那么团队应该很容易将系统从一个平台迁移到另一个平台。如果我们想将一个模块从一个系统迁移到另一个系统,技术团队和基础架构团队都应该很容易做到。

它安全吗?
系统安全是一个主要问题。如果需要的话,新的体系结构应该具有安全性和用户隐私性这一特性。

部署方便吗?
如果您希望在将来部署一些更改,那么部署应该很容易
如果您想销售最终产品,那么部署过程应该足够简单,这将减少您的工作和时间。

在开发时间上节省时间吗?
RB系统的实现与开发
架构应该节省时间
体系结构本身不应该花太多时间来实现

我们的用户容易使用吗?
体系结构可能很复杂,但对于最终用户来说,它必须是用户友好且易于使用的

如果您可以采用前面所有的点或其中大多数点,那么尝试使用您认为最适合系统的体系结构来实现一组小问题,然后,在最后,再次询问前面所有的问题并评估输出。

如果你仍然得到肯定的答案,那么你就可以开干了!在这里,设计既不是对的也不是错的;它是最适合您的NLP应用程序的。
问答系统可以使用如图所示的体系结构。

Alt
您可以看到一种非常不同的体系结构。Q/A系统的方法是基于本体的RB系统。问题处理和文档处理是我们的主要规则引擎。在这里,我们不考虑一个高层次的问答系统。我们希望开发一个针对儿童的问答系统,这些儿童可以就故事提出问题,系统将根据规则和可用的故事数据返回答案。

让我们详细了解每个组件:
当用户提交问题时,解析器解析该问题。
使用解释器解析与知识库、本体和关键字同义词库匹配的问题。
在这里,我们也应用推理和事实。
我们从问题中得到一些事实,并使用查询分类和重新编制对用户问题进行分类。
然后,将已经生成的事实和分类查询发送到文档处理部分,在该部分将事实提供给搜索引擎。
答案提取是Q/A系统的核心RB引擎,因为它使用事实并应用推理技术(如前链或后链)来提取所有可能的答案。现在您将要了解向后链接和向前链接。所以,在这里,我给你一个简短的概述。在前向链接中,我们从可用的数据开始,并使用推理规则从数据中提取更多的事实,直到实现一个目标。该技术用于专家系统,以了解接下来会发生什么。在向后链接中,我们从一个目标列表开始,然后向后工作,找出在过去可能发生的当前结果的条件。这些技术帮助我们理解为什么会发生这种情况。
一旦所有可能的答案都生成了,那么它将被发送回用户。我想问你一个问题。

如果开发Q/A系统,您希望选择哪种数据库?在你开始之前先想一想!

我想选择NoSQL数据库而不是SQL数据库,这背后有几个原因。系统应可供用户24\7使用。在这里,我们关心我们的用户。用户可以随时访问系统,可用性是关键部分。所以,我愿意选择nosql数据库,如果以后我们想对用户的问题和答案进行分析,就需要把用户的问题和系统的答案保存在数据库中。

7.3.4 练习

假设您正在开发一个语法修正系统,那么您设计的是哪种系统架构?试着在纸上设计它!让你的思想表达出来。

7.3.5 Apache UIMA架构

ApacheUIMA基本上是由IBM开发的,用于处理非结构化数据。
以下是UIMA的特点:
UIMA将为我们提供基础设施、组件和框架
UMIA有一个内置的RB引擎和gate库,用于对文本进行预处理。
以下工具作为组件的一部分提供。它们是:
语言识别工具
句子分割工具
NER工具
我们可以用Java、Ruta和C++编写代码。它是一个灵活、模块化和易于使用的框架,C/C++注释器也支持Python和Perl。

UIMA的应用/用途包括:
IBM Watson使用UIMA分析非结构化数据
临床文本分析和知识提取系统(apache ctakes)使用基于uima的系统从病历中提取信息。

使用UIMA的挑战包括:
您需要在Java、Ruta或C++中对规则进行编码。虽然,为了优化,许多RB系统使用C++,为RUTA获得最好的人力资源是一项具有挑战性的任务。
如果你是新来的,你需要一些时间熟悉它。

7.4 规则式系统的开发周期

在本节中,我们将介绍RB系统的开发生命周期,如果您想开发自己的系统,这将在将来帮助您。图7.8描述了RB系统的开发生命周期。这个数字很容易解释,所以不需要额外的描述。

如果我们遵循RB开发生命周期的各个阶段,那么我们的生命将很容易:
Alt

7.5 规则式系统的应用

在这一部分中,我将应用程序分为两个部分:一个是NLP应用程序,另一个是广义人工智能应用程序。

7.5.1 使用规则式系统的NLP应用

句子边界检测:

句子边界检测对于一般的英语写作来说很容易,但是当你处理研究论文或其他科学文献时会很复杂。因此,手工制作的后处理规则将有助于我们准确识别句子边界。这种方法已被Grammarly公司用于语法纠正系统。机器翻译:

当我们想到机器翻译系统时,在我们的脑海中,我们想到的是谷歌神经机器翻译(GNMT)系统。对于许多印度语言来说,谷歌使用一个复杂的基于规则的系统和一个统计预测系统,因此他们有一个混合系统。2016年,谷歌推出了基于神经网络的机器翻译系统。许多研究项目仍然使用机器翻译的RB系统,其中大多数尝试开发尚未开发的语言。基于模板的聊天机器人:

聊天机器人是当今市场的新潮流和潮流。它们的基本版本是基于模板的方法,其中,定义了一组问题或关键字,我们已经将答案映射到每个关键字。
这个系统的好部分是匹配关键字。因此,如果您使用任何其他语言,但您的聊天信息包含我们定义的关键字,则系统可以向您发送适当的消息作为响应。
不好的部分是,如果你犯了任何拼写错误,系统将无法以正确的方式响应。我们将从头开始开发这个应用程序。我将在下一节中解释编码部分,所以继续阅读并启动您的计算机!语法校正系统:

对语法校正系统也进行了使用。在本发明的应用中,我们可以提供一些简单的规则,非常复杂的规则。在下一节,我们将看到一些基本的语法校正规则。问答系统:

问答系统也使用RB系统,但这里有一件不同的事情正在发生。Q/A系统使用语义获取提交问题的答案,并将语义放入图片中,我们使用基于本体的RB方法。

7.5.2 使用规则式系统的通用AI应用

您已经看到了使用RB方法的NLP应用程序。现在,进入通用人工智能应用程序,它使用RB方法和其他技术:自动驾驶汽车或无驾驶汽车:

在本章开头,我将自动驾驶汽车的例子突出使用RB系统的目的。自动驾驶汽车也采用混合动力方式。许多大公司,从谷歌到特斯拉,都在试图制造自动驾驶汽车,他们的试验是为了开发最值得信赖的自动驾驶汽车。这个应用程序是在最初几天使用复杂的RB系统开发的。然后,实验转向了ML技术的方向。如今,公司正在实施深度学习技术,使系统变得更好。机器人应用:

人工智能社区的长期目标是开发与人类技能互补的机器人。我们有一个目标,我们要发展机器人,帮助人类完成他们的工作,这些任务基本上是耗时的。假设有一个机器人帮助你做家务。这种任务可以由机器人在所有可能情况下的规则的帮助下执行。NASA专家系统:

美国航天局利用通用编程语言CLIPS(C语言集成生产系统)开发了专家系统。现在,我认为理论已经足够了。现在我们应该尝试从头开始开发一些RB应用程序。准备好编码。我们将在下一节开始编码之旅。

7.6 使用规则式系统来开发NLP应用

pip install NLTK
pip install beautifulsoup44.6.0
pip install requests
2.18.1
pip install pycorenlp
pip install Flask0.12.2
pip install Flask-Cors
3.0.2
pip install Flask-PyMongo0.5.1
pip install pytz
2017.2

安装MongoDB
1、 官方下载monodb:http://www.mongodb.org/downloads 现在最新版本mongodb-win32-x86_64-2008plus-ssl-4.0.5-signed.msi
2、 直接安装,其中中介有默认安装和自定义路径安装

7.6.1 编写规则的思维过程

我们讨论了很多规则,但是这些规则实际上是如何派生出来的呢?语言学家在推导NLP应用程序规则时的思维过程是什么?那么,让我们从这个思考过程开始。

你需要像一个语言学家一样思考一会儿。记住你在这本书中学到的所有概念,成为一名语言学家。

假设您正在开发语法更正系统的规则,特别是针对英语的规则。所以,我描述的是语言学家的思维过程,这个思维过程可以帮助你制定规则:

  • 我需要知道什么?

你应该知道你所创造的语言的语法规则,这里的语言是英语
你应该知道结构、词序和其他与语言相关的概念。
前两点是前提条件

  • 从哪里开始?

如果你知道所有与语言有关的事情,那么你就需要观察和学习不正确的句子。
现在,当你学习不正确的句子时,你需要知道句子中有什么错误。
在那之后,你需要考虑错误的类别,错误是否与语法相关,或者它们是否由于语义上的歧义而引起。
在所有这些之后,你可以将你的语言相关知识映射到句子中的错误上。

  • 规则如何派生?

一旦你发现句子中的错误,那就把注意力集中在你的思考过程上。当你发现错误时,你的大脑会怎么想?
想想你的大脑对你所发现的每一个错误的反应。
您可以捕获错误,因为您知道语言或其他与语言相关的技术资料(句子语法结构、语义知识等)的语法事实。你的大脑实际上帮助你
你的大脑知道正确的方式来解释给定的文本使用给定的语言
这就是你能抓住错误的原因。同时,你也有一些确凿的理由,基于此,你已经发现了错误。
一旦你识别出错误,根据错误的不同类别,你可以通过使用某些逻辑规则改变句子的某些部分来纠正错误。
你可以改变词序,或者改变主语动词的一致性,或者你可以一起改变一些短语或者全部短语。
答对了!在这一点上,你将得到你的规则。你知道错误是什么,你也知道把不正确的句子转换成正确的句子的步骤是什么
你的规则逻辑只不过是把不正确的句子转换成正确的句子的步骤。

  • 我需要注意哪些要素?
    首先,你需要考虑一个非常简单的方法来纠正错误或错误的句子。
    尝试制定基于模式的规则
    如果无法派生基于模式的规则,请检查是否可以使用
    分析和/或形态分析器结果,然后检查其他工具和库
    顺便说一下,这里有一个陷阱。定义规则时,还需要考虑规则逻辑对于实现的可行性。
    工具是否可用?如果工具可用,则可以编写规则代码,或者开发人员可以编写规则代码。
    如果工具不可用,则需要放弃规则
    当您定义一个规则,然后检查是否有任何工具可供编码人员用于编码定义的规则逻辑时,需要进行研究。
    所选工具应能够对规则的异常情况进行编码。
    如果您的团队中有一些语言学家,那么定义规则和研究工具可能是语言学家的基本任务。如果不是,那么作为编码人员,您需要搜索可用于编码规则逻辑的工具。

7.6.2 从简单规则开始

下面脚本,它获取了维基百科上题为编程语言的页面。https://en.wikipedia.org/wiki/Programming_language
从给定页面的文本中提取编程语言的名称是我们的目标。举个例子:页面有C、C++、JAVA、JavaScript等编程语言。我想提取它们。这些词可以是句子的一部分,也可以单独出现在文本数据内容中。

在这里,我们的任务可以分为三部分:
清除文本数据
为我们的目标定义规则
编码我们的规则并生成原型和结果
清除文本数据

from bs4 import BeautifulSoup
import requests
def scrapdata():
    url = 'https://en.wikipedia.org/wiki/Programming_language'
    content = requests.get(url).content
    soup = BeautifulSoup(content, 'lxml')
    tag = soup.find('div', {'class': 'mw-content-ltr'})
    paragraphs = tag.findAll('p')
    for para in paragraphs:
        paraexport = para.text.encode('utf-8')
        print(paraexport)
        savedatainfile(paraexport)

def savedatainfile(filecontent):
    file = open("simpleruledata.txt", "a+")
    file.write(str(filecontent.strip()+b"\n"))  
    file.close()
scrapdata()
b'\n'
b'A programming language is a formal language, which comprises a set of instructions used to produce various kinds of output. Programming languages are used in computer programming to create programs that implement specific algorithms.\n'
b"Most programming languages consist of instructions for computers, although there are programmable machines that use a limited set of specific instructions, rather than the general programming languages of modern computers. Early ones preceded the invention of the digital computer, the first probably being the automatic flute player described in the 9th century by the brothers Musa in Baghdad, during the Islamic Golden Age.[1] From the early 1800s, programs were used to direct the behavior of machines such as Jacquard looms, music boxes and player pianos.[2] However, their programs (such as a player piano's scrolls) could not produce different behavior in response to some input or condition.\n"
b'Thousands of different programming languages have been created, mainly in the computer field, and many more still are being created every year. Many programming languages require computation to be specified in an imperative form (i.e., as a sequence of operations to perform) while other languages use other forms of program specification such as the declarative form (i.e. the desired result is specified, not how to achieve it).\n'
b'The description of a programming language is usually split into the two components of syntax (form) and semantics (meaning). Some languages are defined by a specification document (for example, the C programming language is specified by an ISO Standard) while other languages (such as Perl) have a dominant implementation that is treated as a reference. Some languages have both, with the basic language defined by a standard and extensions taken from the dominant implementation being common.\n'
b'A programming language is a notation for writing programs, which are specifications of a computation or algorithm.[3] Some, but not all, authors restrict the term "programming language" to those languages that can express all possible algorithms.[3][4] Traits often considered important for what constitutes a programming language include:\n'
b'Markup languages like XML, HTML, or troff, which define structured data, are not usually considered programming languages.[13][14][15] Programming languages may, however, share the syntax with markup languages if a computational semantics is defined. XSLT, for example, is a Turing complete language entirely using XML syntax.[16][17][18] Moreover, LaTeX, which is mostly used for structuring documents, also contains a Turing complete subset.[19][20]\n'
b'The term computer language is sometimes used interchangeably with programming language.[21] However, the usage of both terms varies among authors, including the exact scope of each. One usage describes programming languages as a subset of computer languages.[22] In this vein, languages used in computing that have a different goal than expressing computer programs are generically designated computer languages. For instance, markup languages are sometimes referred to as computer languages to emphasize that they are not meant to be used for programming.[23]\n'
b'Another usage regards programming languages as theoretical constructs for programming abstract machines, and computer languages as the subset thereof that runs on physical computers, which have finite hardware resources.[24] John C. Reynolds emphasizes that formal specification languages are just as much programming languages as are the languages intended for execution. He also argues that textual and even graphical input formats that affect the behavior of a computer are programming languages, despite the fact they are commonly not Turing-complete, and remarks that ignorance of programming language concepts is the reason for many flaws in input formats.[25]\n'
b'Very early computers, such as Colossus, were programmed without the help of a stored program, by modifying their circuitry or setting banks of physical controls.\n'
b'Slightly later, programs could be written in machine language, where the programmer writes each instruction in a numeric form the hardware can execute directly. For example, the instruction to add the value in two memory location might consist of 3 numbers: an "opcode" that selects the "add" operation, and two memory locations. The programs, in decimal or binary form, were read in from punched cards or paper tape or magnetic tape or toggled in on switches on the front panel of the computer.  Machine languages were later termed first-generation programming languages (1GL).\n'
b'The next step was development of so-called second-generation programming languages (2GL) or assembly languages, which were still closely tied to the instruction set architecture of the specific computer. These served to make the program much more human-readable and relieved the programmer of tedious and error-prone address calculations.\n'
b'The first high-level programming languages, or third-generation programming languages (3GL), were written in the 1950s. An early high-level programming language to be designed for a computer was Plankalk\xc3\xbcl, developed for the German Z3 by Konrad Zuse between 1943 and 1945. However, it was not implemented until 1998 and 2000.[26]\n'
b"John Mauchly's Short Code, proposed in 1949, was one of the first high-level languages ever developed for an electronic computer.[27] Unlike machine code, Short Code statements represented mathematical expressions in understandable form. However, the program had to be translated into machine code every time it ran, making the process much slower than running the equivalent machine code.\n"
b'At the University of Manchester, Alick Glennie developed Autocode in the early 1950s. A programming language, it used a compiler to automatically convert the language into machine code. The first code and compiler was developed in 1952 for the Mark 1 computer at the University of Manchester and is considered to be the first compiled high-level programming language.[28][29]\n'
b'The second autocode was developed for the Mark 1 by R. A. Brooker in 1954 and was called the "Mark 1 Autocode". Brooker also developed an autocode for the Ferranti Mercury in the 1950s in conjunction with the University of Manchester. The version for the EDSAC 2 was devised by D. F. Hartley of  University of Cambridge Mathematical Laboratory in 1961. Known as EDSAC 2 Autocode, it was a straight development from Mercury Autocode adapted for local circumstances and was noted for its object code optimisation and source-language diagnostics which were advanced for the time. A contemporary but separate thread of development, Atlas Autocode was developed for the University of Manchester Atlas 1 machine.\n'
b"In 1954, FORTRAN was invented at IBM by John Backus. It was the first widely used high-level general purpose programming language to have a functional implementation, as opposed to just a design on paper.[30][31] It is still a popular language for high-performance computing[32] and is used for programs that benchmark and rank the world's fastest supercomputers.[33]\n"
b'Another early programming language was devised by Grace Hopper in the US, called FLOW-MATIC. It was developed for the UNIVAC I at Remington Rand during the period from 1955 until 1959. Hopper found that business data processing customers were uncomfortable with mathematical notation, and in early 1955, she and her team wrote a specification for an English programming language and implemented a prototype.[34] The FLOW-MATIC compiler became publicly available in early 1958 and was substantially complete in 1959.[35] FLOW-MATIC was a major influence in the design of COBOL, since only it and its direct descendant AIMACO were in actual use at the time.[36]\n'
b'The increased use of high-level languages introduced a requirement for low-level programming languages or system programming languages. These languages, to varying degrees, provide facilities between assembly languages and high-level languages and can be used to perform tasks which require direct access to hardware facilities but still provide higher-level control structures and error-checking.\n'
b'The period from the 1960s to the late 1970s brought the development of the major language paradigms now in use:\n'
b'Each of these languages spawned descendants, and most modern programming languages count at least one of them in their ancestry.\n'
b'The 1960s and 1970s also saw considerable debate over the merits of structured programming, and whether programming languages should be designed to support it.[39] Edsger Dijkstra, in a famous 1968 letter published in the Communications of the ACM, argued that GOTO statements should be eliminated from all "higher level" programming languages.[40]\n'
b'The 1980s were years of relative consolidation. C++ combined object-oriented and systems programming. The United States government standardized Ada, a systems programming language derived from Pascal and intended for use by defense contractors. In Japan and elsewhere, vast sums were spent investigating so-called "fifth generation" languages that incorporated logic programming constructs.[41] The functional languages community moved to standardize ML and Lisp. Rather than inventing new paradigms, all of these movements elaborated upon the ideas invented in the previous decades.\n'
b'One important trend in language design for programming large-scale systems during the 1980s was an increased focus on the use of modules or large-scale organizational units of code. Modula-2, Ada, and ML all developed notable module systems in the 1980s, which were often wedded to generic programming constructs.[42]\n'
b'The rapid growth of the Internet in the mid-1990s created opportunities for new languages. Perl, originally a Unix scripting tool first released in 1987, became common in dynamic websites. Java came to be used for server-side programming, and bytecode virtual machines became popular again in commercial settings with their promise of "Write once, run anywhere" (UCSD Pascal had been popular for a time in the early 1980s). These developments were not fundamentally novel, rather they were refinements of many existing languages and paradigms (although their syntax was often based on the C family of programming languages).\n'
b"Programming language evolution continues, in both industry and research. Current directions include security and reliability verification, new kinds of modularity (mixins, delegates, aspects), and database integration such as Microsoft's LINQ.\n"
b'Fourth-generation programming languages (4GL) are computer programming languages which aim to provide a higher level of abstraction of the internal computer hardware details than 3GLs. Fifth generation programming languages (5GL) are programming languages based on solving problems using constraints given to the program, rather than using an algorithm written by a programmer.\n'
b'All programming languages have some primitive building blocks for the description of data and the processes or transformations applied to them (like the addition of two numbers or the selection of an item from a collection). These primitives are defined by syntactic and semantic rules which describe their structure and meaning respectively.\n'
b"A programming language's surface form is known as its syntax. Most programming languages are purely textual; they use sequences of text including words, numbers, and punctuation, much like written natural languages. On the other hand, there are some programming languages which are more graphical in nature, using visual relationships between symbols to specify a program.\n"
b'The syntax of a language describes the possible combinations of symbols that form a syntactically correct program. The meaning given to a combination of symbols is handled by semantics (either formal or hard-coded in a reference implementation). Since most languages are textual, this article discusses textual syntax.\n'
b'Programming language syntax is usually defined using a combination of regular expressions (for lexical structure) and Backus\xe2\x80\x93Naur form (for grammatical structure). Below is a simple grammar, based on Lisp:\n'
b'This grammar specifies the following:\n'
b'The following are examples of well-formed token sequences in this grammar: 12345, () and (a b c232 (1)).\n'
b"Not all syntactically correct programs are semantically correct. Many syntactically correct programs are nonetheless ill-formed, per the language's rules; and may (depending on the language specification and the soundness of the implementation) result in an error on translation or execution. In some cases, such programs may exhibit undefined behavior. Even when a program is well-defined within a language, it may still have a meaning that is not intended by the person who wrote it.\n"
b'Using natural language as an example, it may not be possible to assign a meaning to a grammatically correct sentence or the sentence may be false:\n'
b'The following C language fragment is syntactically correct, but performs operations that are not semantically defined (the operation *p >> 4 has no meaning for a value having a complex type and p->im is not defined because the value of p is the null pointer):\n'
b'If the type declaration on the first line were omitted, the program would trigger an error on undefined variable "p" during compilation. However, the program would still be syntactically correct since type declarations provide only semantic information.\n'
b"The grammar needed to specify a programming language can be classified by its position in the Chomsky hierarchy. The syntax of most programming languages can be specified using a Type-2 grammar, i.e., they are context-free grammars.[43] Some languages, including Perl and Lisp, contain constructs that allow execution during the parsing phase. Languages that have constructs that allow the programmer to alter the behavior of the parser make syntax analysis an undecidable problem, and generally blur the distinction between parsing and execution.[44] In contrast to Lisp's macro system and Perl's BEGIN blocks, which may contain general computations, C macros are merely string replacements and do not require code execution.[45]\n"
b'The term semantics refers to the meaning of languages, as opposed to their form (syntax).\n'
b'The static semantics defines restrictions on the structure of valid texts that are hard or impossible to express in standard syntactic formalisms.[3] For compiled languages, static semantics essentially include those semantic rules that can be checked at compile time. Examples include checking that every identifier is declared before it is used (in languages that require such declarations) or that the labels on the arms of a case statement are distinct.[46] Many important restrictions of this type, like checking that identifiers are used in the appropriate context (e.g. not adding an integer to a function name), or that subroutine calls have the appropriate number and type of arguments, can be enforced by defining them as rules in a logic called a type system. Other forms of static analyses like data flow analysis may also be part of static semantics. Newer programming languages like Java and C# have definite assignment analysis, a form of data flow analysis, as part of their static semantics.\n'
b'Once data has been specified, the machine must be instructed to perform operations on the data. For example, the semantics may define the strategy by which expressions are evaluated to values, or the manner in which control structures conditionally execute statements. The dynamic semantics (also known as execution semantics) of a language defines how and when the various constructs of a language should produce a program behavior. There are many ways of defining execution semantics. Natural language is often used to specify the execution semantics of languages commonly used in practice. A significant amount of academic research went into formal semantics of programming languages, which allow execution semantics to be specified in a formal manner. Results from this field of research have seen limited application to programming language design and implementation outside academia.\n'
b'A type system defines how a programming language classifies values and expressions into types, how it can manipulate those types and how they interact. The goal of a type system is to verify and usually enforce a certain level of correctness in programs written in that language by detecting certain incorrect operations. Any decidable type system involves a trade-off: while it rejects many incorrect programs, it can also prohibit some correct, albeit unusual programs. In order to bypass this downside, a number of languages have type loopholes, usually unchecked casts that may be used by the programmer to explicitly allow a normally disallowed operation between different types. In most typed languages, the type system is used only to type check programs, but a number of languages, usually functional ones, infer types, relieving the programmer from the need to write type annotations. The formal design and study of type systems is known as type theory.\n'
b'A language is typed if the specification of every operation defines types of data to which the operation is applicable, with the implication that it is not applicable to other types.[47] For example, the data represented by "this text between the quotes" is a string, and in many programming languages dividing a number by a string has no meaning and will be rejected by the compilers. The invalid operation may be detected when the program is compiled ("static" type checking) and will be rejected by the compiler with a compilation error message, or it may be detected when the program is run ("dynamic" type checking), resulting in a run-time exception. Many languages allow a function called an exception handler to be written to handle this exception and, for example, always return "-1" as the result.\n'
b'A special case of typed languages are the single-type languages. These are often scripting or markup languages, such as REXX or SGML, and have only one data type[dubious  \xe2\x80\x93 discuss]-\xe2\x80\x94most commonly character strings which are used for both symbolic and numeric data.\n'
b'In contrast, an untyped language, such as most assembly languages, allows any operation to be performed on any data, which are generally considered to be sequences of bits of various lengths.[47] High-level languages which are untyped include BCPL, Tcl, and some varieties of Forth.\n'
b"In practice, while few languages are considered typed from the point of view of type theory (verifying or rejecting all operations), most modern languages offer a degree of typing.[47] Many production languages provide means to bypass or subvert the type system, trading type-safety for finer control over the program's execution (see casting).\n"
b'In static typing, all expressions have their types determined prior to when the program is executed, typically at compile-time. For example, 1 and (2+2) are integer expressions; they cannot be passed to a function that expects a string, or stored in a variable that is defined to hold dates.[47]\n'
b'Statically typed languages can be either manifestly typed or type-inferred. In the first case, the programmer must explicitly write types at certain textual positions (for example, at variable declarations). In the second case, the compiler infers the types of expressions and declarations based on context. Most mainstream statically typed languages, such as C++, C# and Java, are manifestly typed. Complete type inference has traditionally been associated with less mainstream languages, such as Haskell and ML. However, many manifestly typed languages support partial type inference; for example, C++, Java and C# all infer types in certain limited cases.[48] Additionally, some programming languages allow for some types to be automatically converted to other types; for example, an int can be used where the program expects a float.\n'
b'Dynamic typing, also called latent typing, determines the type-safety of operations at run time; in other words, types are associated with run-time values rather than textual expressions.[47] As with type-inferred languages, dynamically typed languages do not require the programmer to write explicit type annotations on expressions. Among other things, this may permit a single variable to refer to values of different types at different points in the program execution. However, type errors cannot be automatically detected until a piece of code is actually executed, potentially making debugging more difficult. Lisp, Smalltalk, Perl, Python, JavaScript, and Ruby are all examples of dynamically typed languages.\n'
b'Weak typing allows a value of one type to be treated as another, for example treating a string as a number.[47] This can occasionally be useful, but it can also allow some kinds of program faults to go undetected at compile time and even at run time.\n'
b'Strong typing prevents the above. An attempt to perform an operation on the wrong type of value raises an error.[47] Strongly typed languages are often termed type-safe or safe.\n'
b'An alternative definition for "weakly typed" refers to languages, such as Perl and JavaScript, which permit a large number of implicit type conversions. In JavaScript, for example, the expression 2 * x implicitly converts x to a number, and this conversion succeeds even if x is null, undefined, an Array, or a string of letters. Such implicit conversions are often useful, but they can mask programming errors.\nStrong and static are now generally considered orthogonal concepts, but usage in the literature differs. Some use the term strongly typed to mean strongly, statically typed, or, even more confusingly, to mean simply statically typed. Thus C has been called both strongly typed and weakly, statically typed.[49][50]\n'
b'It may seem odd to some professional programmers that C could be "weakly, statically typed". However, notice that the use of the generic pointer, the void* pointer, does allow for casting of pointers to other pointers without needing to do an explicit cast. This is extremely similar to somehow casting an array of bytes to any kind of datatype in C without using an explicit cast, such as (int) or (char).\n'
b"Most programming languages have an associated core library (sometimes known as the 'standard library', especially if it is included as part of the published language standard), which is conventionally made available by all implementations of the language. Core libraries typically include definitions for commonly used algorithms, data structures, and mechanisms for input and output.\n"
b'The line between a language and its core library differs from language to language. In some cases, the language designers may treat the library as a separate entity from the language. However, a language\'s core library is often treated as part of the language by its users, and some language specifications even require that this library be made available in all implementations. Indeed, some languages are designed so that the meanings of certain syntactic constructs cannot even be described without referring to the core library. For example, in Java, a string literal is defined as an instance of the java.lang.String class; similarly, in Smalltalk, an anonymous function expression (a "block") constructs an instance of the library\'s BlockContext class. Conversely, Scheme contains multiple coherent subsets that suffice to construct the rest of the language as library macros, and so the language designers do not even bother to say which portions of the language must be implemented as language constructs, and which must be implemented as parts of a library.\n'
b'Programming languages share properties with natural languages related to their purpose as vehicles for communication, having a syntactic form separate from its semantics, and showing language families of related languages branching one from another.[51][52] But as artificial constructs, they also differ in fundamental ways from languages that have evolved through usage. A significant difference is that a programming language can be fully described and studied in its entirety, since it has a precise and finite definition.[53] By contrast, natural languages have changing meanings given by their users in different communities. While constructed languages are also artificial languages designed from the ground up with a specific purpose, they lack the precise and complete semantic definition that a programming language has.\n'
b'Many programming languages have been designed from scratch, altered to meet new needs, and combined with other languages. Many have eventually fallen into disuse. Although there have been attempts to design one "universal" programming language that serves all purposes, all of them have failed to be generally accepted as filling this role.[54] The need for diverse programming languages arises from the diversity of contexts in which languages are used:\n'
b'One common trend in the development of programming languages has been to add more ability to solve problems using a higher level of abstraction. The earliest programming languages were tied very closely to the underlying hardware of the computer. As new programming languages have developed, features have been added that let programmers express ideas that are more remote from simple translation into underlying hardware instructions. Because programmers are less tied to the complexity of the computer, their programs can do more computing with less effort from the programmer. This lets them write more functionality per time unit.[55]\n'
b'\nNatural language programming has been proposed as a way to eliminate the need for a specialized language for programming. However, this goal remains distant and its benefits are open to debate. Edsger W. Dijkstra took the position that the use of a formal language is essential to prevent the introduction of meaningless constructs, and dismissed natural language programming as "foolish".[56] Alan Perlis was similarly dismissive of the idea.[57] Hybrid approaches have been taken in Structured English and SQL.\n'
b"A language's designers and users must construct a number of artifacts that govern and enable the practice of programming. The most important of these artifacts are the language specification and implementation.\n"
b'The specification of a programming language is an artifact that the language users and the implementors can use to agree upon whether a piece of source code is a valid program in that language, and if so what its behavior shall be.\n'
b'A programming language specification can take several forms, including the following:\n'
b'An implementation of a programming language provides a way to write programs in that language and execute them on one or more configurations of hardware and software. There are, broadly, two approaches to programming language implementation: compilation and interpretation. It is generally possible to implement a language using either technique.\n'
b'The output of a compiler may be executed by hardware or a program called an interpreter. In some implementations that make use of the interpreter approach there is no distinct boundary between compiling and interpreting. For instance, some implementations of BASIC compile and then execute the source a line at a time.\n'
b'Programs that are executed directly on the hardware usually run several orders of magnitude faster than those that are interpreted in software.[citation needed]\n'
b'One technique for improving the performance of interpreted programs is just-in-time compilation. Here the virtual machine, just before execution, translates the blocks of bytecode which are going to be used to machine code, for direct execution on the hardware.\n'
b'Although most of the most commonly used programming languages have fully open specifications and implementations, many programming languages exist only as proprietary programming languages with the implementation available only from a single vendor, which may claim that such a proprietary language is their intellectual property. Proprietary programming languages are commonly domain specific languages or internal scripting languages for a single product; some proprietary languages are used only internally within a vendor, while others are available to external users.\n'
b"Some programming languages exist on the border between proprietary and open; for example, Oracle Corporation asserts proprietary rights to some aspects of the Java programming language,[61] and Microsoft's C# programming language, which has open implementations of most parts of the system, also has Common Language Runtime (CLR) as a closed environment.[62]\n"
b"Many proprietary languages are widely used, in spite of their proprietary nature; examples include MATLAB, VBScript, and Wolfram Language.  Some languages may make the transition from closed to open; for example, Erlang was originally an Ericsson's internal programming language.[63]\n"
b'Thousands of different programming languages have been created, mainly in the computing field.[64]\nSoftware is commonly built with 5 programming languages or more.[65]\n'
b'Programming languages differ from most other forms of human expression in that they require a greater degree of precision and completeness. When using a natural language to communicate with other people, human authors and speakers can be ambiguous and make small errors, and still expect their intent to be understood. However, figuratively speaking, computers "do exactly what they are told to do", and cannot "understand" what code the programmer intended to write. The combination of the language definition, a program, and the program\'s inputs must fully specify the external behavior that occurs when the program is executed, within the domain of control of that program. On the other hand, ideas about an algorithm can be communicated to humans without the precision required for execution by using pseudocode, which interleaves natural language with code written in a programming language.\n'
b'A programming language provides a structured mechanism for defining pieces of data, and the operations or transformations that may be carried out automatically on that data. A programmer uses the abstractions present in the language to represent the concepts involved in a computation. These concepts are represented as a collection of the simplest elements available (called primitives).[66] Programming is the process by which programmers combine these primitives to compose new programs, or adapt existing ones to new uses or a changing environment.\n'
b'Programs for a computer might be executed in a batch process without human interaction, or a user might type commands in an interactive session of an interpreter. In this case the "commands" are simply programs, whose execution is chained together. When a language can run its commands through an interpreter (such as a Unix shell or other command-line interface), without compiling, it is called a scripting language.[67]\n'
b'Determining which is the most widely used programming language is difficult since the definition of usage varies by context. One language may occupy the greater number of programmer hours, a different one has more lines of code, and a third may consume the most CPU time. Some languages are very popular for particular kinds of applications. For example, COBOL is still strong in the corporate data center, often on large mainframes;[68][69] Fortran in scientific and engineering applications; Ada in aerospace, transportation, military, real-time and embedded applications; and C in embedded applications and operating systems. Other languages are regularly used to write many different kinds of applications.\n'
b'Various methods of measuring language popularity, each subject to a different bias over what is measured, have been proposed:\n'
b'Combining and averaging information from various internet sites, stackify.com reported the ten most popular programming languages as (in descending order by overall popularity): Java, C, C++, Python, C#, JavaScript, VB .NET, R, PHP, and MATLAB.[73]\n'
b'A dialect of a programming language or a data exchange language is a (relatively small) variation or extension of the language that does not change its intrinsic nature. With languages such as Scheme and Forth, standards may be considered insufficient, inadequate or illegitimate by implementors, so often they will deviate from the standard, making a new dialect. In other cases, a dialect is created for use in a domain-specific language, often a subset. In the Lisp world, most languages that use basic S-expression syntax and Lisp-like semantics are considered Lisp dialects, although they vary wildly, as do, say, Racket and Clojure. As it is common for one language to have several dialects, it can become quite difficult for an inexperienced programmer to find the right documentation. The BASIC programming language has many dialects.\n'
b'The explosion of Forth dialects led to the saying "If you\'ve seen one Forth... you\'ve seen one Forth."\n'
b'There is no overarching classification scheme for programming languages. A given programming language does not usually have a single ancestor language. Languages commonly arise by combining the elements of several predecessor languages with new ideas in circulation at the time. Ideas that originate in one language will diffuse throughout a family of related languages, and then leap suddenly across familial gaps to appear in an entirely different family.\n'
b'The task is further complicated by the fact that languages can be classified along multiple axes. For example, Java is both an object-oriented language (because it encourages object-oriented organization) and a concurrent language (because it contains built-in constructs for running multiple threads in parallel). Python is an object-oriented scripting language.\n'
b'In broad strokes, programming languages divide into programming paradigms and a classification by intended domain of use, with general-purpose programming languages distinguished from domain-specific programming languages. Traditionally, programming languages have been regarded as describing computation in terms of imperative sentences, i.e. issuing commands. These are generally called imperative programming languages. A great deal of research in programming languages has been aimed at blurring the distinction between a program as a set of instructions and a program as an assertion about the desired answer, which is the main feature of declarative programming.[74] More refined paradigms include procedural programming, object-oriented programming, functional programming, and logic programming; some languages are hybrids of paradigms or multi-paradigmatic. An assembly language is not so much a paradigm as a direct model of an underlying machine architecture. By purpose, programming languages might be considered general purpose, system programming languages, scripting languages, domain-specific languages, or concurrent/distributed languages (or a combination of these).[75] Some general purpose languages were designed largely with educational goals.[76]\n'
b'A programming language may also be classified by factors unrelated to programming paradigm. For instance, most programming languages use English language keywords, while a minority do not. Other languages may be classified as being deliberately esoteric or not.\n'

为我们的目标定义规则
现在,如果你看看我们收集的数据,你可以找到句子。现在,在分析文本之后,您需要定义一个规则,用于仅提取编程语言名称,例如Java、JavaScript、Matlab等等。然后,思考一下什么样的简单规则或逻辑可以帮助你实现你的目标。好好想想,慢慢来!试着专注于你的思考过程,并试着找出模式。如果我想定义一个规则,那么我将在给定数据的上下文中概括我的问题。在我的分析过程中,我注意到大多数编程语言关键字都是与单词语言一起出现的。我注意到,当语言作为一个词出现在句子中时,实际编程的可能性很高语言名称也出现在那个句子中。例如,C编程语言由ISO标准指定。在给定的例子中,C编程语言出现,单词语言也出现在句子中。因此,我将执行以下过程。首先,我需要提取包含语言的句子作为一个词。现在作为第二步,我将开始处理提取的句子,并检查句子中是否有大写单词或驼色大小写单词。然后,如果我找到任何大写的单词或驼色大小写的单词,我需要提取它们,并将它们放入列表中,因为大多数编程语言都显示为大写的单词或驼色大小写的单词格式。参见实例:C、C++、Java、JavaScript等。在某些情况下,一个句子包含多个编程语言的名称。
前面的过程是我们的规则,规则的逻辑形式如下:

以语言为词提取句子
然后试着找出句子中驼色或大写的单词
把这些单词列成一个单子
打印列表
编码规则并生成原型和结果
这个例子给出了规则制定过程的实际本质。这是我们的第一步,所以我们不太注重准确性。我知道,这不是解决这个问题的唯一方法,也不是最有效的方法。还有其他有效的方法来实现同一个问题,但是我使用这个方法是因为我觉得这个解决方案是最简单和最容易理解的。
这个例子可以帮助您理解如何对规则进行编码,以及在获得第一个原型的结果之后,您可以采取哪些后续步骤来改进输出。

def rulelogic(filecontent):
    programminglanguagelist = []
    with open(filecontent,encoding='UTF-8')as file:
        for line in file:
            if 'languages' in line or 'language' in line:
                # print line
                words = line.split()
                for word in words:
                    if word[0].isupper():
                        programminglanguagelist.append(word)
                        # print programminglanguagelist
        print(programminglanguagelist) 
rulelogic("simpleruledata.txt")
['A', 'Programming', 'Programming', 'The', 'Musa', 'Baghdad,', 'Islamic', 'Golden', 'Age".[1]', 'From', 'Jacquard', 'Thousands', 'Many', 'The', 'Some', 'C', 'ISO', 'Standard)', 'Perl)', 'Some', 'A', 'Some,', 'Traits', 'Markup', 'XML,', 'HTML,', 'Programming', 'XSLT,', 'Turing', 'XML', 'Moreover,', 'LaTeX,', 'Turing', 'The', 'However,', 'One', 'In', 'For', 'Another', 'John', 'C.', 'Reynolds', 'He', 'Turing-complete,', 'The', 'The', 'Absolute', 'The', 'These', 'The', 'An', 'Plankalkül,', 'German', 'Z3', 'Konrad', 'Zuse', 'However,', 'John', "Mauchly's", 'Short', 'Code,', 'Unlike', 'Short', 'Code', 'However,', 'At', 'University', 'Manchester,', 'Alick', 'Glennie', 'Autocode', 'A', 'The', 'Mark', 'University', 'Manchester', 'The', 'Mark', 'R.', 'A.', 'Brooker', 'Autocode".', 'Brooker', 'Ferranti', 'Mercury', 'University', 'Manchester.', 'The', 'EDSAC', 'D.', 'F.', 'Hartley', 'University', 'Cambridge', 'Mathematical', 'Laboratory', 'Known', 'EDSAC', 'Autocode,', 'Mercury', 'Autocode', 'A', 'Atlas', 'Autocode', 'University', 'Manchester', 'Atlas', 'In', 'FORTRAN', 'IBM', 'John', 'Backus.', 'It', 'It', 'Another', 'Grace', 'Hopper', 'US,', 'FLOW-MATIC.', 'It', 'UNIVAC', 'I', 'Remington', 'Rand', 'Hopper', 'English', 'The', 'FLOW-MATIC', 'Flow-Matic', 'COBOL,', 'AIMACO', 'The', 'These', 'The', 'Each', 'The', 'Edsger', 'Dijkstra,', 'Communications', 'ACM,', 'GOTO', 'The', 'C++', 'The', 'United', 'States', 'Ada,', 'Pascal', 'In', 'Japan', 'The', 'ML', 'Lisp.', 'Rather', 'One', 'Modula-2,', 'Ada,', 'ML', 'The', 'Internet', 'Perl,', 'Unix', 'Java', 'Pascal', 'These', 'C', 'Programming', 'Current', "Microsoft's", 'LINQ.', 'Fourth-generation', 'Fifth', 'All', 'These', 'A', 'Most', 'On', 'The', 'The', 'Since', 'Programming', 'Backus–Naur', 'Below', 'Lisp:', 'Not', 'Many', 'In', 'Even', 'Using', 'The', 'C', 'The', 'Chomsky', 'The', 'Type-2', 'Some', 'Perl', 'Lisp,', 'Languages', 'In', "Lisp's", "Perl's", 'BEGIN', 'C', 'The', 'The', 'For', 'Examples', 'Many', 'Other', 'Newer', 'Java', 'C#', 'Once', 'For', 'The', 'There', 'Natural', 'A', 'Results', 'A', 'The', 'Any', 'In', 'In', 'The', 'A', 'For', 'The', 'Many', 'A', 'These', 'REXX', 'SGML,', 'In', 'High-level', 'BCPL,', 'Tcl,', 'Forth.', 'In', 'Many', 'Statically', 'In', 'In', 'Most', 'C++,', 'C#', 'Java,', 'Complete', 'Haskell', 'ML.', 'However,', 'Java', 'C#', 'Additionally,', 'Dynamic', 'As', 'Among', 'However,', 'Lisp,', 'Smalltalk,', 'Perl,', 'Python,', 'JavaScript,', 'Ruby', 'Strong', 'An', 'Strongly', 'An', 'Perl', 'JavaScript,', 'In', 'JavaScript,', 'Array,', 'Such', 'Strong', 'Some', 'Thus', 'C', 'Most', 'Core', 'The', 'In', 'However,', 'Indeed,', 'For', 'Java,', 'Smalltalk,', 'BlockContext', 'Conversely,', 'Scheme', 'Programming', 'But', 'A', 'By', 'While', 'Many', 'Many', 'Although', 'The', 'One', 'The', 'As', 'Because', 'This', 'Natural', 'However,', 'Edsger', 'W.', 'Dijkstra', 'Alan', 'Perlis', 'Hybrid', 'Structured', 'English', 'SQL.', 'A', 'The', 'The', 'A', 'An', 'There', 'It', 'Although', 'Proprietary', 'Some', 'Oracle', 'Corporation', 'Java', "Microsoft's", 'C#', 'Common', 'Language', 'Runtime', 'Many', 'MATLAB', 'VBScript.', 'Some', 'Erlang', "Ericsson's", 'Thousands', 'Software', 'Programming', 'When', 'However,', 'The', 'On', 'A', 'A', 'These', 'Programming', 'Programs', 'In', 'When', 'Unix', 'It', 'One', 'CPU', 'Some', 'For', 'COBOL', 'Fortran', 'Ada', 'C', 'Other', 'Various', 'Combining', 'C,', 'Java,', 'PHP,', 'JavaScript,', 'C++,', 'Python,', 'Shell,', 'Ruby,', 'Objective-C', 'C#.[70]', 'There', 'A', 'Languages', 'Ideas', 'The', 'For', 'Java', 'Python', 'In', 'Traditionally,', 'These', 'A', 'More', 'An', 'By', 'Some', 'A', 'For', 'English', 'Other', 'A', 'Programming', 'Most', 'Early', 'Musa', 'Baghdad,', 'Islamic', 'Golden', 'Age.[1]', 'From', 'Jacquard', 'However,', 'Thousands', 'Many', 'The', 'Some', 'C', 'ISO', 'Standard)', 'Perl)', 'Some', 'A', 'Some,', 'Traits', 'Markup', 'XML,', 'HTML,', 'Programming', 'XSLT,', 'Turing', 'XML', 'Moreover,', 'LaTeX,', 'Turing', 'The', 'However,', 'One', 'In', 'For', 'Another', 'John', 'C.', 'Reynolds', 'He', 'Turing-complete,', 'Slightly', 'For', 'The', 'Machine', 'The', 'These', 'The', 'An', 'Plankalkül,', 'German', 'Z3', 'Konrad', 'Zuse', 'However,', 'John', "Mauchly's", 'Short', 'Code,', 'Unlike', 'Short', 'Code', 'However,', 'At', 'University', 'Manchester,', 'Alick', 'Glennie', 'Autocode', 'A', 'The', 'Mark', 'University', 'Manchester', 'The', 'Mark', 'R.', 'A.', 'Brooker', 'Autocode".', 'Brooker', 'Ferranti', 'Mercury', 'University', 'Manchester.', 'The', 'EDSAC', 'D.', 'F.', 'Hartley', 'University', 'Cambridge', 'Mathematical', 'Laboratory', 'Known', 'EDSAC', 'Autocode,', 'Mercury', 'Autocode', 'A', 'Atlas', 'Autocode', 'University', 'Manchester', 'Atlas', 'In', 'FORTRAN', 'IBM', 'John', 'Backus.', 'It', 'It', 'Another', 'Grace', 'Hopper', 'US,', 'FLOW-MATIC.', 'It', 'UNIVAC', 'I', 'Remington', 'Rand', 'Hopper', 'English', 'The', 'FLOW-MATIC', 'FLOW-MATIC', 'COBOL,', 'AIMACO', 'The', 'These', 'The', 'Each', 'The', 'Edsger', 'Dijkstra,', 'Communications', 'ACM,', 'GOTO', 'The', 'C++', 'The', 'United', 'States', 'Ada,', 'Pascal', 'In', 'Japan', 'The', 'ML', 'Lisp.', 'Rather', 'One', 'Modula-2,', 'Ada,', 'ML', 'The', 'Internet', 'Perl,', 'Unix', 'Java', 'Pascal', 'These', 'C', 'Programming', 'Current', "Microsoft's", 'LINQ.', 'Fourth-generation', 'Fifth', 'All', 'These', 'A', 'Most', 'On', 'The', 'The', 'Since', 'Programming', 'Backus–Naur', 'Below', 'Lisp:', 'Not', 'Many', 'In', 'Even', 'Using', 'The', 'C', 'The', 'Chomsky', 'The', 'Type-2', 'Some', 'Perl', 'Lisp,', 'Languages', 'In', "Lisp's", "Perl's", 'BEGIN', 'C', 'The', 'The', 'For', 'Examples', 'Many', 'Other', 'Newer', 'Java', 'C#', 'Once', 'For', 'The', 'There', 'Natural', 'A', 'Results', 'A', 'The', 'Any', 'In', 'In', 'The', 'A', 'For', 'The', 'Many', 'A', 'These', 'REXX', 'SGML,', 'In', 'High-level', 'BCPL,', 'Tcl,', 'Forth.', 'In', 'Many', 'Statically', 'In', 'In', 'Most', 'C++,', 'C#', 'Java,', 'Complete', 'Haskell', 'ML.', 'However,', 'C++,', 'Java', 'C#', 'Additionally,', 'Dynamic', 'As', 'Among', 'However,', 'Lisp,', 'Smalltalk,', 'Perl,', 'Python,', 'JavaScript,', 'Ruby', 'Strong', 'An', 'Strongly', 'An', 'Perl', 'JavaScript,', 'In', 'JavaScript,', 'Array,', 'Such', 'Most', 'Core', 'The', 'In', 'However,', 'Indeed,', 'For', 'Java,', 'Smalltalk,', 'BlockContext', 'Conversely,', 'Scheme', 'Programming', 'But', 'A', 'By', 'While', 'Many', 'Many', 'Although', 'The', 'One', 'The', 'As', 'Because', 'This', 'Natural', 'However,', 'Edsger', 'W.', 'Dijkstra', 'Alan', 'Perlis', 'Hybrid', 'Structured', 'English', 'SQL.', 'A', 'The', 'The', 'A', 'An', 'There', 'It', 'Although', 'Proprietary', 'Some', 'Oracle', 'Corporation', 'Java', "Microsoft's", 'C#', 'Common', 'Language', 'Runtime', 'Many', 'MATLAB,', 'VBScript,', 'Wolfram', 'Language.', 'Some', 'Erlang', "Ericsson's", 'Thousands', 'Software', 'Programming', 'When', 'However,', 'The', 'On', 'A', 'A', 'These', 'Programming', 'Programs', 'In', 'When', 'Unix', 'Determining', 'One', 'CPU', 'Some', 'For', 'COBOL', 'Fortran', 'Ada', 'C', 'Other', 'Various', 'Combining', 'Java,', 'C,', 'C++,', 'Python,', 'C#,', 'JavaScript,', 'VB', 'R,', 'PHP,', 'MATLAB.[73]', 'A', 'With', 'Scheme', 'Forth,', 'In', 'In', 'Lisp', 'S-expression', 'Lisp-like', 'Lisp', 'Racket', 'Clojure.', 'As', 'The', 'BASIC', 'There', 'A', 'Languages', 'Ideas', 'The', 'For', 'Java', 'Python', 'In', 'Traditionally,', 'These', 'A', 'More', 'An', 'By', 'Some', 'A', 'For', 'English', 'Other', 'Programming', 'Early', 'Musa', 'Baghdad,', 'Islamic', 'Golden', 'Age.[1]', 'From', 'Jacquard', 'However,', 'Many', 'Some', 'C', 'ISO', 'Standard)', 'Perl)', 'Some', 'Some,', 'Traits', 'XML,', 'HTML,', 'Programming', 'XSLT,', 'Turing', 'XML', 'Moreover,', 'LaTeX,', 'Turing', 'However,', 'One', 'In', 'For', 'John', 'C.', 'Reynolds', 'He', 'Turing-complete,', 'Colossus,', 'For', 'The', 'Machine', 'These', 'An', 'Plankalk\\xc3\\xbcl,', 'German', 'Z3', 'Konrad', 'Zuse', 'However,', "Mauchly's", 'Short', 'Code,', 'Unlike', 'Short', 'Code', 'However,', 'University', 'Manchester,', 'Alick', 'Glennie', 'Autocode', 'A', 'The', 'Mark', 'University', 'Manchester', 'Mark', 'R.', 'A.', 'Brooker', 'Autocode".', 'Brooker', 'Ferranti', 'Mercury', 'University', 'Manchester.', 'The', 'EDSAC', 'D.', 'F.', 'Hartley', 'University', 'Cambridge', 'Mathematical', 'Laboratory', 'Known', 'EDSAC', 'Autocode,', 'Mercury', 'Autocode', 'A', 'Atlas', 'Autocode', 'University', 'Manchester', 'Atlas', 'FORTRAN', 'IBM', 'John', 'Backus.', 'It', 'It', 'Grace', 'Hopper', 'US,', 'FLOW-MATIC.', 'It', 'UNIVAC', 'I', 'Remington', 'Rand', 'Hopper', 'English', 'The', 'FLOW-MATIC', 'FLOW-MATIC', 'COBOL,', 'AIMACO', 'These', 'Edsger', 'Dijkstra,', 'Communications', 'ACM,', 'GOTO', 'C++', 'The', 'United', 'States', 'Ada,', 'Pascal', 'In', 'Japan', 'The', 'ML', 'Lisp.', 'Rather', 'Modula-2,', 'Ada,', 'ML', 'Internet', 'Perl,', 'Unix', 'Java', 'Pascal', 'These', 'C', 'Current', "Microsoft's", 'LINQ.\\n"b\'Fourth-generation', 'Fifth', 'These', 'Most', 'On', 'The', 'Since', 'Backus\\xe2\\x80\\x93Naur', 'Below', "Lisp:\\n'b'This", 'Many', 'In', 'Even', 'C', 'However,', 'Chomsky', 'The', 'Type-2', 'Some', 'Perl', 'Lisp,', 'Languages', 'In', "Lisp's", "Perl's", 'BEGIN', 'C', 'For', 'Examples', 'Many', 'Other', 'Newer', 'Java', 'C#', 'For', 'The', 'There', 'Natural', 'A', 'Results', 'The', 'Any', 'In', 'In', 'The', 'For', 'The', 'Many', 'These', 'REXX', 'SGML,', 'High-level', 'BCPL,', 'Tcl,', 'Forth.\\n\'b"In', 'Many', 'For', 'In', 'In', 'Most', 'C++,', 'C#', 'Java,', 'Complete', 'Haskell', 'ML.', 'However,', 'C++,', 'Java', 'C#', 'Additionally,', 'As', 'Among', 'However,', 'Lisp,', 'Smalltalk,', 'Perl,', 'Python,', 'JavaScript,', 'Ruby', 'This', 'An', 'Strongly', 'Perl', 'JavaScript,', 'In', 'JavaScript,', 'Array,', 'Such', 'Some', 'Thus', 'C', 'C', 'However,', 'This', 'C', 'Core', 'In', 'However,', 'Indeed,', 'For', 'Java,', 'Smalltalk,', 'BlockContext', 'Conversely,', 'Scheme', 'But', 'A', 'By', 'While', 'Many', 'Although', 'The', 'The', 'As', 'Because', 'This', 'However,', 'Edsger', 'W.', 'Dijkstra', 'Alan', 'Perlis', 'Hybrid', 'Structured', 'English', 'SQL.\\n\'b"A', 'The', 'There', 'It', 'In', 'For', 'BASIC', 'Here', 'Proprietary', 'Oracle', 'Corporation', 'Java', "Microsoft's", 'C#', 'Common', 'Language', 'Runtime', 'MATLAB,', 'VBScript,', 'Wolfram', 'Language.', 'Some', 'Erlang', "Ericsson's", 'When', 'However,', 'The', 'On', 'A', 'These', 'Programming', 'In', 'When', 'Unix', 'One', 'CPU', 'Some', 'For', 'COBOL', 'Fortran', 'Ada', 'C', 'Other', 'Java,', 'C,', 'C++,', 'Python,', 'C#,', 'JavaScript,', 'VB', 'R,', 'PHP,', "MATLAB.[73]\\n'b'A", 'With', 'Scheme', 'Forth,', 'In', 'In', 'Lisp', 'S-expression', 'Lisp-like', 'Lisp', 'Racket', 'Clojure.', 'As', 'The', 'BASIC', 'Forth', 'Forth...', 'Forth."\\n\'b\'There', 'A', 'Languages', 'Ideas', 'For', 'Java', 'Python', 'Traditionally,', 'These', 'A', 'More', 'An', 'By', 'Some', 'For', 'English', 'Other', 'Programming', 'Early', 'Musa', 'Baghdad,', 'Islamic', 'Golden', 'Age.[1]', 'From', 'Jacquard', 'However,', 'Many', 'Some', 'C', 'ISO', 'Standard)', 'Perl)', 'Some', 'Some,', 'Traits', 'XML,', 'HTML,', 'Programming', 'XSLT,', 'Turing', 'XML', 'Moreover,', 'LaTeX,', 'Turing', 'However,', 'One', 'In', 'For', 'John', 'C.', 'Reynolds', 'He', 'Turing-complete,', 'Colossus,', 'For', 'The', 'Machine', 'These', 'An', 'Plankalk\\xc3\\xbcl,', 'German', 'Z3', 'Konrad', 'Zuse', 'However,', "Mauchly's", 'Short', 'Code,', 'Unlike', 'Short', 'Code', 'However,', 'University', 'Manchester,', 'Alick', 'Glennie', 'Autocode', 'A', 'The', 'Mark', 'University', 'Manchester', 'Mark', 'R.', 'A.', 'Brooker', 'Autocode".', 'Brooker', 'Ferranti', 'Mercury', 'University', 'Manchester.', 'The', 'EDSAC', 'D.', 'F.', 'Hartley', 'University', 'Cambridge', 'Mathematical', 'Laboratory', 'Known', 'EDSAC', 'Autocode,', 'Mercury', 'Autocode', 'A', 'Atlas', 'Autocode', 'University', 'Manchester', 'Atlas', 'FORTRAN', 'IBM', 'John', 'Backus.', 'It', 'It', 'Grace', 'Hopper', 'US,', 'FLOW-MATIC.', 'It', 'UNIVAC', 'I', 'Remington', 'Rand', 'Hopper', 'English', 'The', 'FLOW-MATIC', 'FLOW-MATIC', 'COBOL,', 'AIMACO', 'These', 'Edsger', 'Dijkstra,', 'Communications', 'ACM,', 'GOTO', 'C++', 'The', 'United', 'States', 'Ada,', 'Pascal', 'In', 'Japan', 'The', 'ML', 'Lisp.', 'Rather', 'Modula-2,', 'Ada,', 'ML', 'Internet', 'Perl,', 'Unix', 'Java', 'Pascal', 'These', 'C', 'Current', "Microsoft's", 'LINQ.\\n"b\'Fourth-generation', 'Fifth', 'These', 'Most', 'On', 'The', 'Since', 'Backus\\xe2\\x80\\x93Naur', 'Below', "Lisp:\\n'b'This", 'Many', 'In', 'Even', 'C', 'However,', 'Chomsky', 'The', 'Type-2', 'Some', 'Perl', 'Lisp,', 'Languages', 'In', "Lisp's", "Perl's", 'BEGIN', 'C', 'For', 'Examples', 'Many', 'Other', 'Newer', 'Java', 'C#', 'For', 'The', 'There', 'Natural', 'A', 'Results', 'The', 'Any', 'In', 'In', 'The', 'For', 'The', 'Many', 'These', 'REXX', 'SGML,', 'High-level', 'BCPL,', 'Tcl,', 'Forth.\\n\'b"In', 'Many', 'For', 'In', 'In', 'Most', 'C++,', 'C#', 'Java,', 'Complete', 'Haskell', 'ML.', 'However,', 'C++,', 'Java', 'C#', 'Additionally,', 'As', 'Among', 'However,', 'Lisp,', 'Smalltalk,', 'Perl,', 'Python,', 'JavaScript,', 'Ruby', 'This', 'An', 'Strongly', 'Perl', 'JavaScript,', 'In', 'JavaScript,', 'Array,', 'Such', 'Some', 'Thus', 'C', 'C', 'However,', 'This', 'C', 'Core', 'In', 'However,', 'Indeed,', 'For', 'Java,', 'Smalltalk,', 'BlockContext', 'Conversely,', 'Scheme', 'But', 'A', 'By', 'While', 'Many', 'Although', 'The', 'The', 'As', 'Because', 'This', 'However,', 'Edsger', 'W.', 'Dijkstra', 'Alan', 'Perlis', 'Hybrid', 'Structured', 'English', 'SQL.\\n\'b"A', 'The', 'There', 'It', 'In', 'For', 'BASIC', 'Here', 'Proprietary', 'Oracle', 'Corporation', 'Java', "Microsoft's", 'C#', 'Common', 'Language', 'Runtime', 'MATLAB,', 'VBScript,', 'Wolfram', 'Language.', 'Some', 'Erlang', "Ericsson's", 'When', 'However,', 'The', 'On', 'A', 'These', 'Programming', 'In', 'When', 'Unix', 'One', 'CPU', 'Some', 'For', 'COBOL', 'Fortran', 'Ada', 'C', 'Other', 'Java,', 'C,', 'C++,', 'Python,', 'C#,', 'JavaScript,', 'VB', 'R,', 'PHP,', "MATLAB.[73]\\n'b'A", 'With', 'Scheme', 'Forth,', 'In', 'In', 'Lisp', 'S-expression', 'Lisp-like', 'Lisp', 'Racket', 'Clojure.', 'As', 'The', 'BASIC', 'Forth', 'Forth...', 'Forth."\\n\'b\'There', 'A', 'Languages', 'Ideas', 'For', 'Java', 'Python', 'Traditionally,', 'These', 'A', 'More', 'An', 'By', 'Some', 'For', 'English', 'Other', 'Programming', 'Early', 'Musa', 'Baghdad,', 'Islamic', 'Golden', 'Age.[1]', 'From', 'Jacquard', 'However,', 'Many', 'Some', 'C', 'ISO', 'Standard)', 'Perl)', 'Some', 'Some,', 'Traits', 'XML,', 'HTML,', 'Programming', 'XSLT,', 'Turing', 'XML', 'Moreover,', 'LaTeX,', 'Turing', 'However,', 'One', 'In', 'For', 'John', 'C.', 'Reynolds', 'He', 'Turing-complete,', 'Colossus,', 'For', 'The', 'Machine', 'These', 'An', 'Plankalk\\xc3\\xbcl,', 'German', 'Z3', 'Konrad', 'Zuse', 'However,', "Mauchly's", 'Short', 'Code,', 'Unlike', 'Short', 'Code', 'However,', 'University', 'Manchester,', 'Alick', 'Glennie', 'Autocode', 'A', 'The', 'Mark', 'University', 'Manchester', 'Mark', 'R.', 'A.', 'Brooker', 'Autocode".', 'Brooker', 'Ferranti', 'Mercury', 'University', 'Manchester.', 'The', 'EDSAC', 'D.', 'F.', 'Hartley', 'University', 'Cambridge', 'Mathematical', 'Laboratory', 'Known', 'EDSAC', 'Autocode,', 'Mercury', 'Autocode', 'A', 'Atlas', 'Autocode', 'University', 'Manchester', 'Atlas', 'FORTRAN', 'IBM', 'John', 'Backus.', 'It', 'It', 'Grace', 'Hopper', 'US,', 'FLOW-MATIC.', 'It', 'UNIVAC', 'I', 'Remington', 'Rand', 'Hopper', 'English', 'The', 'FLOW-MATIC', 'FLOW-MATIC', 'COBOL,', 'AIMACO', 'These', 'Edsger', 'Dijkstra,', 'Communications', 'ACM,', 'GOTO', 'C++', 'The', 'United', 'States', 'Ada,', 'Pascal', 'In', 'Japan', 'The', 'ML', 'Lisp.', 'Rather', 'Modula-2,', 'Ada,', 'ML', 'Internet', 'Perl,', 'Unix', 'Java', 'Pascal', 'These', 'C', 'Current', "Microsoft's", 'LINQ.\\n"b\'Fourth-generation', 'Fifth', 'These', 'Most', 'On', 'The', 'Since', 'Backus\\xe2\\x80\\x93Naur', 'Below', "Lisp:\\n'b'This", 'Many', 'In', 'Even', 'C', 'However,', 'Chomsky', 'The', 'Type-2', 'Some', 'Perl', 'Lisp,', 'Languages', 'In', "Lisp's", "Perl's", 'BEGIN', 'C', 'For', 'Examples', 'Many', 'Other', 'Newer', 'Java', 'C#', 'For', 'The', 'There', 'Natural', 'A', 'Results', 'The', 'Any', 'In', 'In', 'The', 'For', 'The', 'Many', 'These', 'REXX', 'SGML,', 'High-level', 'BCPL,', 'Tcl,', 'Forth.\\n\'b"In', 'Many', 'For', 'In', 'In', 'Most', 'C++,', 'C#', 'Java,', 'Complete', 'Haskell', 'ML.', 'However,', 'C++,', 'Java', 'C#', 'Additionally,', 'As', 'Among', 'However,', 'Lisp,', 'Smalltalk,', 'Perl,', 'Python,', 'JavaScript,', 'Ruby', 'This', 'An', 'Strongly', 'Perl', 'JavaScript,', 'In', 'JavaScript,', 'Array,', 'Such', 'Some', 'Thus', 'C', 'C', 'However,', 'This', 'C', 'Core', 'In', 'However,', 'Indeed,', 'For', 'Java,', 'Smalltalk,', 'BlockContext', 'Conversely,', 'Scheme', 'But', 'A', 'By', 'While', 'Many', 'Although', 'The', 'The', 'As', 'Because', 'This', 'However,', 'Edsger', 'W.', 'Dijkstra', 'Alan', 'Perlis', 'Hybrid', 'Structured', 'English', 'SQL.\\n\'b"A', 'The', 'There', 'It', 'In', 'For', 'BASIC', 'Here', 'Proprietary', 'Oracle', 'Corporation', 'Java', "Microsoft's", 'C#', 'Common', 'Language', 'Runtime', 'MATLAB,', 'VBScript,', 'Wolfram', 'Language.', 'Some', 'Erlang', "Ericsson's", 'When', 'However,', 'The', 'On', 'A', 'These', 'Programming', 'In', 'When', 'Unix', 'One', 'CPU', 'Some', 'For', 'COBOL', 'Fortran', 'Ada', 'C', 'Other', 'Java,', 'C,', 'C++,', 'Python,', 'C#,', 'JavaScript,', 'VB', 'R,', 'PHP,', "MATLAB.[73]\\n'b'A", 'With', 'Scheme', 'Forth,', 'In', 'In', 'Lisp', 'S-expression', 'Lisp-like', 'Lisp', 'Racket', 'Clojure.', 'As', 'The', 'BASIC', 'Forth', 'Forth...', 'Forth."\\n\'b\'There', 'A', 'Languages', 'Ideas', 'For', 'Java', 'Python', 'Traditionally,', 'These', 'A', 'More', 'An', 'By', 'Some', 'For', 'English', 'Other', 'Programming', 'Early', 'Musa', 'Baghdad,', 'Islamic', 'Golden', 'Age.[1]', 'From', 'Jacquard', 'However,', 'Many', 'Some', 'C', 'ISO', 'Standard)', 'Perl)', 'Some', 'Some,', 'Traits', 'XML,', 'HTML,', 'Programming', 'XSLT,', 'Turing', 'XML', 'Moreover,', 'LaTeX,', 'Turing', 'However,', 'One', 'In', 'For', 'John', 'C.', 'Reynolds', 'He', 'Turing-complete,', 'Colossus,', 'For', 'The', 'Machine', 'These', 'An', 'Plankalk\\xc3\\xbcl,', 'German', 'Z3', 'Konrad', 'Zuse', 'However,', "Mauchly's", 'Short', 'Code,', 'Unlike', 'Short', 'Code', 'However,', 'University', 'Manchester,', 'Alick', 'Glennie', 'Autocode', 'A', 'The', 'Mark', 'University', 'Manchester', 'Mark', 'R.', 'A.', 'Brooker', 'Autocode".', 'Brooker', 'Ferranti', 'Mercury', 'University', 'Manchester.', 'The', 'EDSAC', 'D.', 'F.', 'Hartley', 'University', 'Cambridge', 'Mathematical', 'Laboratory', 'Known', 'EDSAC', 'Autocode,', 'Mercury', 'Autocode', 'A', 'Atlas', 'Autocode', 'University', 'Manchester', 'Atlas', 'FORTRAN', 'IBM', 'John', 'Backus.', 'It', 'It', 'Grace', 'Hopper', 'US,', 'FLOW-MATIC.', 'It', 'UNIVAC', 'I', 'Remington', 'Rand', 'Hopper', 'English', 'The', 'FLOW-MATIC', 'FLOW-MATIC', 'COBOL,', 'AIMACO', 'These', 'Edsger', 'Dijkstra,', 'Communications', 'ACM,', 'GOTO', 'C++', 'The', 'United', 'States', 'Ada,', 'Pascal', 'In', 'Japan', 'The', 'ML', 'Lisp.', 'Rather', 'Modula-2,', 'Ada,', 'ML', 'Internet', 'Perl,', 'Unix', 'Java', 'Pascal', 'These', 'C', 'Current', "Microsoft's", 'LINQ.\\n"b\'Fourth-generation', 'Fifth', 'These', 'Most', 'On', 'The', 'Since', 'Backus\\xe2\\x80\\x93Naur', 'Below', "Lisp:\\n'b'This", 'Many', 'In', 'Even', 'C', 'However,', 'Chomsky', 'The', 'Type-2', 'Some', 'Perl', 'Lisp,', 'Languages', 'In', "Lisp's", "Perl's", 'BEGIN', 'C', 'For', 'Examples', 'Many', 'Other', 'Newer', 'Java', 'C#', 'For', 'The', 'There', 'Natural', 'A', 'Results', 'The', 'Any', 'In', 'In', 'The', 'For', 'The', 'Many', 'These', 'REXX', 'SGML,', 'High-level', 'BCPL,', 'Tcl,', 'Forth.\\n\'b"In', 'Many', 'For', 'In', 'In', 'Most', 'C++,', 'C#', 'Java,', 'Complete', 'Haskell', 'ML.', 'However,', 'C++,', 'Java', 'C#', 'Additionally,', 'As', 'Among', 'However,', 'Lisp,', 'Smalltalk,', 'Perl,', 'Python,', 'JavaScript,', 'Ruby', 'This', 'An', 'Strongly', 'Perl', 'JavaScript,', 'In', 'JavaScript,', 'Array,', 'Such', 'Some', 'Thus', 'C', 'C', 'However,', 'This', 'C', 'Core', 'In', 'However,', 'Indeed,', 'For', 'Java,', 'Smalltalk,', 'BlockContext', 'Conversely,', 'Scheme', 'But', 'A', 'By', 'While', 'Many', 'Although', 'The', 'The', 'As', 'Because', 'This', 'However,', 'Edsger', 'W.', 'Dijkstra', 'Alan', 'Perlis', 'Hybrid', 'Structured', 'English', 'SQL.\\n\'b"A', 'The', 'There', 'It', 'In', 'For', 'BASIC', 'Here', 'Proprietary', 'Oracle', 'Corporation', 'Java', "Microsoft's", 'C#', 'Common', 'Language', 'Runtime', 'MATLAB,', 'VBScript,', 'Wolfram', 'Language.', 'Some', 'Erlang', "Ericsson's", 'When', 'However,', 'The', 'On', 'A', 'These', 'Programming', 'In', 'When', 'Unix', 'One', 'CPU', 'Some', 'For', 'COBOL', 'Fortran', 'Ada', 'C', 'Other', 'Java,', 'C,', 'C++,', 'Python,', 'C#,', 'JavaScript,', 'VB', 'R,', 'PHP,', "MATLAB.[73]\\n'b'A", 'With', 'Scheme', 'Forth,', 'In', 'In', 'Lisp', 'S-expression', 'Lisp-like', 'Lisp', 'Racket', 'Clojure.', 'As', 'The', 'BASIC', 'Forth', 'Forth...', 'Forth."\\n\'b\'There', 'A', 'Languages', 'Ideas', 'For', 'Java', 'Python', 'Traditionally,', 'These', 'A', 'More', 'An', 'By', 'Some', 'For', 'English', 'Other', 'Programming', 'Early', 'Musa', 'Baghdad,', 'Islamic', 'Golden', 'Age.[1]', 'From', 'Jacquard', 'However,', 'Many', 'Some', 'C', 'ISO', 'Standard)', 'Perl)', 'Some', 'Some,', 'Traits', 'XML,', 'HTML,', 'Programming', 'XSLT,', 'Turing', 'XML', 'Moreover,', 'LaTeX,', 'Turing', 'However,', 'One', 'In', 'For', 'John', 'C.', 'Reynolds', 'He', 'Turing-complete,', 'Colossus,', 'For', 'The', 'Machine', 'These', 'An', 'Plankalk\\xc3\\xbcl,', 'German', 'Z3', 'Konrad', 'Zuse', 'However,', "Mauchly's", 'Short', 'Code,', 'Unlike', 'Short', 'Code', 'However,', 'University', 'Manchester,', 'Alick', 'Glennie', 'Autocode', 'A', 'The', 'Mark', 'University', 'Manchester', 'Mark', 'R.', 'A.', 'Brooker', 'Autocode".', 'Brooker', 'Ferranti', 'Mercury', 'University', 'Manchester.', 'The', 'EDSAC', 'D.', 'F.', 'Hartley', 'University', 'Cambridge', 'Mathematical', 'Laboratory', 'Known', 'EDSAC', 'Autocode,', 'Mercury', 'Autocode', 'A', 'Atlas', 'Autocode', 'University', 'Manchester', 'Atlas', 'FORTRAN', 'IBM', 'John', 'Backus.', 'It', 'It', 'Grace', 'Hopper', 'US,', 'FLOW-MATIC.', 'It', 'UNIVAC', 'I', 'Remington', 'Rand', 'Hopper', 'English', 'The', 'FLOW-MATIC', 'FLOW-MATIC', 'COBOL,', 'AIMACO', 'These', 'Edsger', 'Dijkstra,', 'Communications', 'ACM,', 'GOTO', 'C++', 'The', 'United', 'States', 'Ada,', 'Pascal', 'In', 'Japan', 'The', 'ML', 'Lisp.', 'Rather', 'Modula-2,', 'Ada,', 'ML', 'Internet', 'Perl,', 'Unix', 'Java', 'Pascal', 'These', 'C', 'Current', "Microsoft's", 'LINQ.\\n"b\'Fourth-generation', 'Fifth', 'These', 'Most', 'On', 'The', 'Since', 'Backus\\xe2\\x80\\x93Naur', 'Below', "Lisp:\\n'b'This", 'Many', 'In', 'Even', 'C', 'However,', 'Chomsky', 'The', 'Type-2', 'Some', 'Perl', 'Lisp,', 'Languages', 'In', "Lisp's", "Perl's", 'BEGIN', 'C', 'For', 'Examples', 'Many', 'Other', 'Newer', 'Java', 'C#', 'For', 'The', 'There', 'Natural', 'A', 'Results', 'The', 'Any', 'In', 'In', 'The', 'For', 'The', 'Many', 'These', 'REXX', 'SGML,', 'High-level', 'BCPL,', 'Tcl,', 'Forth.\\n\'b"In', 'Many', 'For', 'In', 'In', 'Most', 'C++,', 'C#', 'Java,', 'Complete', 'Haskell', 'ML.', 'However,', 'C++,', 'Java', 'C#', 'Additionally,', 'As', 'Among', 'However,', 'Lisp,', 'Smalltalk,', 'Perl,', 'Python,', 'JavaScript,', 'Ruby', 'This', 'An', 'Strongly', 'Perl', 'JavaScript,', 'In', 'JavaScript,', 'Array,', 'Such', 'Some', 'Thus', 'C', 'C', 'However,', 'This', 'C', 'Core', 'In', 'However,', 'Indeed,', 'For', 'Java,', 'Smalltalk,', 'BlockContext', 'Conversely,', 'Scheme', 'But', 'A', 'By', 'While', 'Many', 'Although', 'The', 'The', 'As', 'Because', 'This', 'However,', 'Edsger', 'W.', 'Dijkstra', 'Alan', 'Perlis', 'Hybrid', 'Structured', 'English', 'SQL.\\n\'b"A', 'The', 'There', 'It', 'In', 'For', 'BASIC', 'Here', 'Proprietary', 'Oracle', 'Corporation', 'Java', "Microsoft's", 'C#', 'Common', 'Language', 'Runtime', 'MATLAB,', 'VBScript,', 'Wolfram', 'Language.', 'Some', 'Erlang', "Ericsson's", 'When', 'However,', 'The', 'On', 'A', 'These', 'Programming', 'In', 'When', 'Unix', 'One', 'CPU', 'Some', 'For', 'COBOL', 'Fortran', 'Ada', 'C', 'Other', 'Java,', 'C,', 'C++,', 'Python,', 'C#,', 'JavaScript,', 'VB', 'R,', 'PHP,', "MATLAB.[73]\\n'b'A", 'With', 'Scheme', 'Forth,', 'In', 'In', 'Lisp', 'S-expression', 'Lisp-like', 'Lisp', 'Racket', 'Clojure.', 'As', 'The', 'BASIC', 'Forth', 'Forth...', 'Forth."\\n\'b\'There', 'A', 'Languages', 'Ideas', 'For', 'Java', 'Python', 'Traditionally,', 'These', 'A', 'More', 'An', 'By', 'Some', 'For', 'English', 'Other']

如您所见,我们的基本规则提取了编程语言,但它也提取了垃圾数据。现在想想如何限制规则,或者如何加入一些约束,这样它会给我们一个准确的输出。那将是你的任务。
用于校对应用程序的模式匹配规则的python
现在,假设您想制作一个校对工具。所以,在这里,我将向您提供一个非常简单的错误,您可以很容易地在任何商业邮件或任何信件中发现。然后我们将尝试以高精度纠正错误。
错误是人们在邮件中指定了会议时间,他们可能指定了时间为下午2pm, 2PM, 或 2P.M.或其他变体,但正确的格式是 2 p.m. 或 9 a.m.。
这种错误可以通过基于模式的规则来修复。下面是规则逻辑。
假设长度为2的数字从1到12开始。在这个数字之后,如果AM和PM发生时没有空格或句点,则添加空格和适当的句点符号。
我将使用正则表达式来实现它。
\b([1-9]|0[1-9]|1[0-2]{1,2})(am)\b
\b([1-9]|0[1-9]|1[0-2]{1,2})(pm)\b

import re

inputstring = "Our meeting will be at 5pm tomorrow."
# inputstring = "Our meeting will be schedule at 11am tomorrow."

findpattern_am = re.search(r'\b([1-9]|0[1-9]|1[0-2]{1,2})(am)\b',
                           inputstring, re.M | re.I)
findpattern_pm = re.search(r'\b([1-9]|0[1-9]|1[0-2]{1,2})(pm)\b',
                           inputstring, re.M | re.I)

if findpattern_am:
    #print findpattern_am.group()
    print(re.sub(r'\b([1-9]|0[1-9]|1[0-2]{1,2})(am)\b', r'\1 a.m.', inputstring)) 
elif findpattern_pm:
    #print findpattern_pm.group()
    print(re.sub(r'\b([1-9]|0[1-9]|1[0-2]{1,2})(pm)\b', r'\1 p.m.', inputstring)) 
else:
    print("Not matched...!") 

Our meeting will be at 5 p.m. tomorrow.

给出的例子是一个基本的例子,但它有助于您思考如何进行校对。许多简单的规则集可以应用于数据,并且根据模式,您将得到正确的结果。

7.6.3 语法更正

我们将对一般现在时态的主语动词一致性规则做一个简单的规则。
我们知道,在一般现在时态中,第三人称单数主语总是以一个s/es作为动词后缀的单数动词。

He drink tomato soup in the morning
She know cooking
We plays game online

from pycorenlp import StanfordCoreNLP
from nltk.tree import Tree

启动stanford-corenlp服务

cd /media/zhou/数据/JavaLibraries/stanford-corenlp-full-2018-10-05
java -mx4g -cp “*” edu.stanford.nlp.pipeline.StanfordCoreNLPServer

nlp = StanfordCoreNLP('http://localhost:9000')
leaves_list = []
text = 'We know cooking.'

output = nlp.annotate(text, properties={
    'annotators': 'tokenize,ssplit,pos,depparse,parse',
    'outputFormat': 'json'
})
parsetree = output['sentences'][0]['parse']

#print parsetree
for i in Tree.fromstring(parsetree).subtrees():
    if i.label() == 'PRP':
        #print i.leaves(), i.label()
        leaves_list.append(i.leaves())
    if i.label() == 'VBP' or i.label() == 'VBZ':
        #print i.leaves(), i.label()
        leaves_list.append(i.label())
#print leaves_list
if (any("We" in x for x in leaves_list) or any("I" in x for x in leaves_list) or any(
                "You" in x for x in leaves_list) or any("They" in x for x in leaves_list)) and any("VBZ" in x for x in leaves_list):
    print("Alert: \nPlease check Subject and verb in the sentence.\nYou may have plural subject and singular verb. ") 
elif(any("He" in x for x in leaves_list) or any("She" in x for x in leaves_list) or any(
                "It" in x for x in leaves_list)) and any("VBP" in x for x in leaves_list):
    print("Alert: \nPlease check subject and verb in the sentence.\n" \
          "You may have singular subject and plural verb.") 
else:
    print("You have correct sentence.") 

You have correct sentence.

7.6.2 基于模板的聊天机器人应用

在这里,我们将看到如何为chatbot应用程序构建一个核心引擎,它可以帮助贷款申请人申请。我们正在以JSON格式生成输出,因此任何前端开发人员都可以将此输出集成到网站上。
在这里,我使用flask Web框架,为我们的聊天机器人提出的每个问题提供Web服务。
如果要保存用户数据,需要安装mongodb。MongoDB的安装步骤如下:https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/conversationengine.py是具有手工规则和代码的核心规则引擎。

def start_converation_action(humanmessage):
START_CONV_KEYWORDS = (“hello”, “hi”, “Hi”, “Hello”)
START_CONV_RESPONSES = [
“Please provide me borrower’s full name”]
text = humanmessage
start_res = “”
if text.lower() in START_CONV_KEYWORDS:
# start_res = random.choice(START_CONV_RESPONSES)
start_conv_json_obj = json.dumps(
{‘message_human’: text, ‘message_bot’: START_CONV_RESPONSES,
‘suggestion_message’: [“Please provide me borrower’s full name”],
‘current_form_action’: “/hi_chat?msg=”,
‘next_form_action’: “/asking_borowers_full_name?msg=”, ‘previous_form_action’: “/welcomemsg_chat”,
‘next_field_type’: “text”,
‘previous_field_type’: “button”, “placeholder_text”: “Enter borrower’s full name”,
“max_length”: “255”},
sort_keys=True, indent=4,
separators=(’,’, ‘: ‘), default=json_util.default)
elif text.lower() == “” or text.lower() is None or len(text) == 0:
start_conv_json_obj = json.dumps({‘message_human’: text,
‘message_bot’: defualt_missing_data_error,
‘suggestion_message’: [“Hi”], ‘current_form_action’: “/hi_chat?msg”,
‘next_form_action’: “”, ‘previous_form_action’: “/welcomemsg_chat”,
‘next_field_type’: “”, ‘previous_field_type’: “button”,
“placeholder_text”: “Hi”},
sort_keys=True, indent=4,
separators=(’,’, ‘: ‘), default=json_util.default)
else:
start_conv_json_obj = json.dumps({‘message_human’: text,
‘message_bot’: defualt_error,
‘suggestion_message’: [“Hi”], ‘current_form_action’: “/hi_chat?msg”,
‘next_form_action’: “”, ‘previous_form_action’: “/welcomemsg_chat”,
‘next_field_type’: “”, ‘previous_field_type’: “button”,
“placeholder_text”: “Hi”
},
sort_keys=True, indent=4,
separators=(’,’, ': '), default=json_util.default)
return start_conv_json_obj在这里,我们使用关键字列表和响应列表来实现聊天机器人。我还定制了JSON模式来导出会话,如果您来自Web开发背景,那么您可以编写JavaScript,它将帮助您在前端用GUI显示这个JSON。现在,让我们看一下Web服务部分:

@app.route(’/’)
def hello_world():
return ‘Hello from chat bot Flask…!’

@app.route("/welcomemsg_chat")
def welcomemsg_chat():
welcome_msg = cs.loan_assistant_welcome_msg()
conversation_list_history.append(welcome_msg)
# db_handler = mongo.db.chathistory
# db_handler.insert({“request_user_id”: request_user_id, “conversation”: conversation_list_history,
# “time”: now_india.strftime(fmt)})
# db_handler.update({“request_user_id”: request_user_id}, {
# 'KaTeX parse error: Expected 'EOF', got '#' at position 125: …ime(fmt)}, #̲ "currentDate": {“lastModified”: True}}, upsert=True)
resp = Response(welcome_msg, status=200, mimetype=‘application/json’)
return resp
现在,要运行脚本并查看输出,请执行以下步骤:
1、第一次运行flaskengin.py
2、转到http://127.0.0.1:5002//,在这里您可以从chatbot flask看到Hello from chatbot Flask!
3、您可以查看chatbot json响应:http://127.0.0.1:5002/welcomemsg_chat
4。您可以看到JSON响应:

{
“current_form_action”: “/welcomemsg_chat”,
“message_bot”: [
“Hi, I’m personal loan application assistant.”,
“You can apply for loan with help of mine.”,
“To keep going say Hi to me.”
],
“message_human”: “”,
“next_field_type”: “button”,
“next_form_action”: “/hi_chat?msg=”,
“placeholder_text”: “Hi”,
“previous_field_type”: “”,
“previous_form_action”: “”,
“suggestion_message”: [
“Hi”
]
}

5、现在,向我们的人类用户提供建议,帮助他们分析预期的输入是什么。因此,在这里,您可以看到json属性建议消息:[“hi”]。因此,用户将看到带有Hi标签的按钮。
6、如果要重定向到下一页或下一个问题,请使用next_form_action并将用户参数放在msg=USER ARGUMENT
7、例如,在http://127.0.0.1:5002/welcomemsg_chat页面。现在,您可以阅读消息“机器人”。它说你需要向机器人打招呼
8、您可以这样回复:http://127.0.0.1:5002/hi_chat?msg=Hi
9、当你输入这个网址: http://0.0.0.0:5002/hi_chat?msg=Hi 你可以看到机器人会问你的名字,现在你需要输入你的名字。
10、要输入您的姓名并重定向到下一个问题,您需要再次检查下一个“表单”动作属性的URL值
11、这里的值是/asking_borowers_email_id?msg=
12、你需要在=符号后加上你的名字,这样URL就变成了
/asking_borowers_email_id?msg=Jalaj Thanaki
13、当你使用http://0.0.0.0:5002/asking_borowers_full_name?msg=Jalaj%20Thana
ki,您可看到下一个问题。
14、首先,您需要运行脚本:flaskengin.py,然后您可以检查以下URL:
http://127.0.0.1:5002/welcomemsg_chat
http://127.0.0.1:5002/hi_chat?msg=Hi
http://127.0.0.1:5002/asking_borowers_full_name
msg=Jalaj%20Thanaki
http://127.0.0.1:5002/asking_borowers_email_id?msg=jalaj@gmai
l.com
http://127.0.0.1:5002/mobilenumber_asking?msg=9425897412
http://127.0.0.1:5002/loan_chat?msg=100000
http://127.0.0.1:5002/end_chat?msg=Bye
基于模板的聊天机器人的优势
易于实施。

节省时间和成本。

用例在开发之前就已经被理解了,所以用户体验也会很好。这是一种模式匹配方法,所以如果用户在他们的对话中使用英语和其他语言,那么用户也会得到答案,因为chatbot识别出他用英语提供的关键字,如果英语关键字与chatbot词汇匹配,那么chatbot不能给你答案。
基于模板的聊天机器人的缺点
它不能用于未发现的用例

用户应该处理严格的对话流

用户拼写错误会给聊天机器人带来问题。在这种情况下,我们将使用深度学习

7.7 规则式系统与其他方法的对比

基于规则的方法是一个非常可靠的引擎,它为您的应用程序提供了高精度。当您将RB方法与ML方法或深度学习方法进行比较时,您会发现以下几点:

  • 对于RB方法,您需要领域专家,而对于ML方法或深度学习方法,您不需要领域专家。
  • RB系统不需要大量的数据,而ML和深度学习则需要大量的数据。
  • 对于RB系统,您需要手动查找模式,而ML和深度学习技术则根据数据和输入特性代表您查找模式。
  • RB系统通常是开发最终产品的第一次切割的好方法,这在实践中仍然很流行。

7.8 规则式系统的优点

可用性:系统对用户的可用性不是问题
成本效益:该系统在最终结果方面具有成本效益和准确性。
速度:您可以优化系统,因为您知道系统的所有部分。所以在几秒钟内提供输出不是一个大问题
准确度和错误率:尽管不同场景的覆盖率较低,但RB系统覆盖的任何场景都将提供较高的准确度。由于这些预先定义的规则,错误率也较小
降低风险:我们正在降低系统准确性方面的风险。
稳定响应:系统产生的输出依赖于规则,因此输出响应是稳定的,这意味着它不能含糊不清。
与人类相同的认知过程:这个系统为你提供与人类相同的结果,就像人类手工制作的一样。
模块化:RB系统的模块化和良好的体系结构可以帮助技术团队轻松维护。这减少了人类的努力和时间
一致性:RB系统在实现和输出方面非常一致。这使得最终用户的生活更容易,因为系统的输出很容易被人理解。
易于实现:这种方法模拟了人类的思维过程,因此对于开发人员来说,规则的实现相对容易。

7.9 规则式系统的缺点

大量的手工工作:RB系统需要对领域有深入的了解,也需要大量的手工工作。

耗时:为复杂系统生成规则非常困难且耗时

学习能力不足:在这里,系统会根据规则生成结果,因此系统本身的学习能力要少得多。

复杂域:如果要构建的应用程序过于复杂,则构建RB系统可能需要花费大量时间和分析。在RB方法中,复杂模式识别是一项具有挑战性的任务。

7.10 规则式系统面临的挑战

模仿人类的行为是不容易的。

选择或设计体系结构是RB系统的关键部分。

为了开发RB系统,您需要成为为我们生成规则的特定领域的专家。对于NLP,我们需要知道如何分析语言的语言学家。

自然语言本身就是一个具有挑战性的领域,因为它有如此多的异常情况,并且使用规则覆盖这些异常也是一个具有挑战性的任务,特别是当您拥有大量规则时。

阿拉伯语、古吉拉特语、印地语和乌尔都语很难在RB系统中实现,因为寻找这些语言的领域专家是一项困难的任务。对于所描述的语言,实现规则的工具也较少。

人类努力的时间消耗太高。

7.11 词义消歧的基础

词义消歧(wsd)是NLP中的一个著名问题。首先,让我们了解什么是WSD。当一个句子中的一个词有多种含义时,WSD用来识别这个词的意义。当一个单词有多个意思时,机器很难识别正确的意思,要解决这个具有挑战性的问题,我们可以使用基于规则的系统或机器学习技术。当你试图解决任何语言的wsd问题时,你需要有大量的数据,在这些数据中,你可以找到不同句子意义不同的单词实例。

一旦你有了这样的数据集,人类专家就会出现。

人类专家被用来标记一个或多个单词的含义,通常标记有一些预定义的ID。现在,让我们举个例子:我有句话:我去了河岸,我去了银行存款。在前面的句子中,单词bank有多种含义,其含义根据整个句子而变化。所以,人类专家被用来标记这些单词。这里,我们的话是银行

因此,人类专家使用预先定义的ID。假设现在ID为100

在第二句中,单词bank通过使用预定义的ID被标记为金融机构。现在假设ID为101。

一旦给出了这个标签,下一个阶段就开始了,即选择基于规则的引擎或受监控的机器学习技术。

如果我们决定采用基于规则的系统,那么人类专家需要想出一个或多个特定的模式或规则来帮助我们消除词义的歧义。

有时,对于某些单词,专家可以通过使用解析结果或使用词性标注来找到规则,但在大多数情况下,他们不能

因此,如今,一旦标记完成,标记数据就被用作输入,以开发一个有监督的机器学习模型,帮助人类识别单词。

有时,只有基于规则的系统不能以同样的方式工作,只有机器学习方法有时不能帮助您。根据我的经验,这是同一种情况。我认为混合方法会给你一个更好的结果

在标记数据之后,我们应该构建一个RB系统,它可以很好地处理已知的情况,并且我们也有一个不能定义规则的情况。为了解决这个问题,我们需要建立一个机器学习模型。

您还可以使用矢量化概念和深度学习模型来解决WSD问题。通过深入学习,您对WSD的研究也可以成为一个研究课题。

7.12 规则式系统近期发展的趋势

本节讨论当前市场如何使用RB系统。很多人在不同的论坛上问了很多问题,他们想知道RB系统的未来,所以我想和你讨论一个重要的问题

帮助您了解NLP市场和RB系统的未来趋势。我有一些问题要问。

RB系统在NLP领域是否过时?我想不回答这个问题。RB系统主要用于所有NLP应用程序、语法更正、语音识别、机器翻译等!当您开始创建任何新的NLP应用程序时,此方法是第一步。如果你想试验你的想法,那么原型可以很容易地在RB方法的帮助下开发。对于原型设计,您需要领域知识和基本的编码技能。你不需要知道高级数学或ML技术。对于基本的原型设计,您应该使用RB系统。深度学习和基于ML的方法能否取代基于RB的系统?这个问题是一个非常开放的问题。我想在这一点上提出一些事实,这将有助于你提出你的问题。现在,我们有大量的数据,我们有廉价的计算能力。人工智能产业和基于人工智能的项目正在引起很多关注。前两点有助于深度学习和ML方法,以获得NLP和其他人工智能应用程序的准确结果。与RB系统相比,这些方法需要更少的人力。这就是为什么这么多人认为RB系统不会被深度学习和基于ML的系统所取代的原因。我认为RB系统不会完全被替换,但它将补充这些方法。现在你问,怎么做?所以,答案是,我想我想采用混合方法,这对我们更有利。我们可以在ML系统的帮助下找到模式或预测,然后将这些预测提供给RB系统,RB系统可以验证预测并为用户选择最佳的预测。这实际上将帮助我们克服RB系统的一个主要挑战,即减少人力和时间。

对于前面的问题,没有任何正确或错误的答案。它是关于如何看到问题和NLP域的。我只想给你留个想法。想一想,想一想你自己的答案。

7.13 总结

在这一章中,我们看到了与基于规则的系统相关的所有细节,以及基于规则的方法如何帮助我们以高精度开发复杂问题的快速原型。我们已经看到了基于规则的系统的体系结构。我们已经了解了基于规则的系统的优点、缺点和挑战。我们已经看到这个系统如何帮助我们开发NLP应用程序,如语法修正系统、聊天机器人等。我们还讨论了基于规则的系统的最新趋势。

在下一章中,我们将学习其他称为机器学习的主要方法,以解决NLP应用程序。下一章将详细介绍开发NLP应用程序需要使用的机器学习算法。我们将看到有监督的ML、半监督的ML和无监督的ML技术。我们也会

从头开始开发一些应用程序。所以继续阅读!

这是自动驾驶汽车考试

致谢
《Python自然语言处理》1 2 3,作者:【印】雅兰·萨纳卡(Jalaj Thanaki),是实践性很强的一部新作。为进一步深入理解书中内容,对部分内容进行了延伸学习、练习,在此分享,期待对大家有所帮助,欢迎加我微信(验证:NLP),一起学习讨论,不足之处,欢迎指正。
在这里插入图片描述

参考文献


  1. https://github.com/jalajthanaki ↩︎

  2. 《Python自然语言处理》,(印)雅兰·萨纳卡(Jalaj Thanaki) 著 张金超 、 刘舒曼 等 译 ,机械工业出版社,2018 ↩︎

  3. Jalaj Thanaki ,Python Natural Language Processing ,2017 ↩︎

猜你喜欢

转载自blog.csdn.net/weixin_43935926/article/details/86736712