Java fresh business platform - electricity suppliers in the massive search ElasticSearch architecture design and actual source code parsing

Java fresh business platform - electricity suppliers in the massive search ElasticSearch architecture design and actual source code parsing

 

Fresh electricity supplier search engine features

As we all know, the standard search engine is divided into three major parts, the first step is the crawler system, the second step is data analysis, the third step is the search results. First, the electricity supplier search engine crawler system did not, because all the data is structured, usually Microsoft's database or Oracle database, so we do not like Baidu as a "crawler" to continue to go elsewhere to find content of course, in fact, the electricity supplier also has its own "crawler" system, usually the price of the Friends crawl, and then to his own adjustment.

The second point is the electricity supplier search engine filtering in fact, than commonly used to search. Even greater than the search itself. What is filtering? Generally our website buy something, a search of the key words, such as diapers, then select all the relevant brand or other classifications will be presented in front of us. For Baidu, what word is what the word search, if it is news, it may have a filter option in time.

Third, the electricity supplier search engine that supports sorting of various dimensions, including support for sorting property of praise, sales, reviews, price and so on. And real-time data requirements are very high. General search engines, only a very important site, such as some heavyweight portals, Baidu's included is very fast, but very little traffic on those sites, could only crawl once a month. Electricity supplier search real-time requirements of data is mainly reflected in two aspects of price and inventory.

Another electricity supplier search engine feature is not lost goods, such as we have in Taobao, Lynx opened a shop, and then finally engage in an activity, but not the search, this is intolerable. In addition, the electricity supplier search engine and recommendation systems and advertising systems are merging because of the contribution of search engines traffic is the biggest, so we all want to talk to its advertising system integration. Of course, there is very important, is to guarantee high availability, and can not be down.

Electricity supplier search engine architecture

Because electricity supplier search engine with search engines in general very different, so the architecture design is also unique. First, there is a search engine implementations are many, there are Google, Baidu, Sogou this very large companies, there Jingdong, Taobao, Dangdang such electricity supplier search engines, many small and medium electric providers may prefer to use an open source search engine. On the whole, these include several of the following ways:

 
Architecture design and performance of the electricity supplier search engine optimization

The first is "Lucene + own package" is only used for retrieval, then package, all the behind ES, these two are a complete solution, but also an index of all the things that need to be deployed only good business logic, then the search results on it.

The second is Solr, which is a high-performance, using Java5 development, full-text search server based on Lucene. At the same time it has been extended to provide a more abundant than Lucene query language, while achieving a configurable, scalable and optimized query performance, and provides a full-featured management interface, it is a very good full-text search engine.

The third is ElasticSearch, this is a search server based on Lucene. It provides a distributed multi-user capabilities of full-text search engine, based on RESTful web interface. Elasticsearch is written in Java, and as open source under the Apache license terms published, is also very much present in use.

Here to mention that, when the search engine is own implementation. Now, the emerging Internet companies mostly use the first or second, larger than the data commonly used third.

Electricity supplier search engine comes standard module

 
Architecture design and performance of the electricity supplier search engine optimization

Next, I want to talk about, if we own a search engine, then what features need to implement (the figure is electricity supplier search engine standard module), in fact, not only is the electricity supplier search engine, in addition to through search of search engines, other search engine is the use of such a standard.

 
Architecture design and performance of the electricity supplier search engine optimization

Retrieval module, the first is the user's intention to analyze, to achieve pure algorithm according to the user's search terms. For example, the user's search term is "black bag", in fact, the user's intention is to buy a black bag, but this "package" may be combined with other words, there may even be "buns" in the search results. So, which requires query analysis system to do, tell retrieval system, you need to major in clothing and shoes go in the classification, rather than fresh food.

Design to the technical level, Dangdang using C ++. If you build a good system performance, some of the old little company, we are using C ++ or C language. Not just Dangdang, in fact, many companies are using C or C ++ implementation of the search engine.

Data update module

 
Architecture design and performance of the electricity supplier search engine optimization

第二个模块就是数据更新模块,该模块负责生成索引。而数据中心模块主要做的事情,就是将原始的结构化数据,变成一个可供检索系统使用的搜索数据库。当然,数据更新模块和检索模块是分开还是合并呢?其实从本质上讲,都是一堆代码,完全可以写在一个进程里。当然,也可以分开,通过网络往外输入,各自都有道理。第一种是简单粗暴型的,如果是普通电商,像生鲜电商,数据量不大,实时性、季节性很强,就可以把两个系统用一个进程来完成。但是如果到了百万、千万甚至上亿级别的话,就不可能部在一台机器上了。

 
电商搜索引擎的架构设计和性能优化

上图就是当两个系统合并在一起的时候,红色部分就是检索系统,黄色部分是上游产生数据的系统,如果是淘宝的话,对接就是淘宝的商户,当当网对接是市场部的人员,他们将数据录入系统,推到数据库,然后向下进行传送,最终建立一个索引。

上图中的蓝色部分就是业务逻辑,因为电商的搜索引擎业务需求量非常高,尤其是现在大家都喜欢用手机进行购物,像手机专享价就是一个新的业务,这也意味着需要一个专用的模块来处理这些商用的逻辑。

此外,就是用户行为的分析,我们搜集到的日志还有其他相关的数据都会存到 Hadoop 集群上去,通过离线计算,然后传给商业模块或者排序模块进行排序和打分,并提供给用户更好的使用体验。

出问题是不可避免的!如何解决?

虽然整理来看,设计的思路是非常合理的,但是还是会出现问题。一般而言,一个成熟的电商搜索系统,它的问题都很集中,要这几种情况:首先就是 Bug,当然这是所有系统都会遇到的问题;第二个就是并发,但是搜索系统是没办法进行分库分表,所以能做的就是索引切分;最后一点就是监控,包括问题追踪、日志系统和监控系统,那么为了解决这些问题,我们应该怎么做?

首先,针对 Bug 问题,只能靠自动化运维去解决(这里也推荐使用 OneAPM 工具);第二个就是高并发的问题,目前主要是靠缓存和横向扩展。而缓存和横向扩展怎么应用到系统中去,这个很关键。很多人也说可以换一种语言,比如讲 Python 换成 C++,但实际情况下,换语言并不能解决并发的问题,好的数据结构的设计比换一种语言更能提高性能,所以一般解决高并发问题的也就是缓存和横向扩展。

第三个就是使用用 FLUME 日志系统(Flume 是 Cloudera 提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统,Flume 支持在日志系统中定制各类数据发送方,用于收集数据;同时,Flume 提供对数据进行简单处理,并写到各种数据接受方(可定制)的能力)。其实,Flume 会把集群上每一个节点的日志全都收集起来,这样做起来有两个好处,第一是现场出问题,可以先回滚出 Bug,然后进行查询。第二个就是对日志进行搜集,然后做用户行为分析,查看用户点击了多少次,从何处导入的流量等等,从而便于更好的进行排序。

 
电商搜索引擎的架构设计和性能优化

然后讲一下缓存的问题。一般搜索的缓存可能分为两级缓存,据我观察,像搜狗可能是使用页面级缓存,而百度可能用的是索引级的缓存。比如在搜狗搜索一个词,开始时可能需要 40 毫秒,然后再搜的话,就可能一下子降到 1 毫秒。这就是页面级缓存。而百度可能第一次搜索用了 40 毫秒,第二次就是 25 毫秒,它并不是把页面给缓存下来,而是将索引的倒排链缓存,级别其实是不一样的。

电商搜索很多使用的是两级缓存,对于特别热门的词汇,我们可以做页面级缓存,而页面级缓存的时间只有 15 秒到 20 秒。但是像价格这样的东西不能缓存,需要前台页面去反拉价格。第二级就是索引级别的缓存,实际上也是自建的一个缓存系统。另外,排序也有缓存,因为排序的结果不太会有太大的变化。

 
电商搜索引擎的架构设计和性能优化

上图是当当的搜索架构,这里有一个集群是做数据分析的,上面备满了数据。

首先,集群之间采用什么样的通讯方式?我们主要使用 ZMQ(这是一个简单好用的传输层,像框架一样的一个 socket library,使得 Socket 编程更加简单、简洁和性能更高。是一个消息处理队列库,可在多个线程、内核和主机盒之间弹性伸缩)。原因其实只有一个,就是快,非常快,比较适合数据量比较大的业务。

如何避免冷启动?

最后就是冷启动的问题,这个问题是很多电商网站都很头疼的问题。尤其是随着电商网站的商品数量达到一定量级的时候,比如已经上亿了,像淘宝、天猫的话应该更多。如果重建了一次索引需要启动,或者新上线了一个业务模块,需要重启系统,是很麻烦的。

当然,当集群大了以后有很多方法,比如分开启动之类的,至于技术嘛,一般索引的加载都是使用 Lunix 标准的 MMAP(MMAP 将一个文件或者其它对象映射进内存。文件被映射到多个页上,如果文件的大小不是所有页的大小之和,最后一个页不被使用的空间将会清零。MMAP 在用户空间映射调用系统中作用很大),这样启动速度会很快,但是系统会有预热时间,前面一些时间的查询会比较慢

如果数据量不是特别大的话,而且现在内存也那么便宜,完全可以将数据一次性读入内存,因为 mmap 的操作毕竟性能没有直接内存来得快。

第三种的话,就是尽量减少做全量数据的频率,避免整个系统的重启,这需要定期做一下索引的优化,把没用的索引干掉。

如果是新上了一个业务模块需要重启集群,这样的事情最好不要发生,这就是架构有问题了,将业务模块变成外部的模块或者插件进行上线才是正确的,不然每上线一个模块需要重启集群,这谁都受不了。

 

Guess you like

Origin www.cnblogs.com/jurendage/p/11328959.html