The Road of Architect--Introduction to Search Business and Technology and Fault Tolerance Mechanism

Today, I did a migration of MQ with the search department, and exchanged business and technology by the way. I found that the post-90s guys are pretty good now. I mean ability and inquiry. My boy does not hire a son-in-law.

  As mentioned in the previous article, we have a media library (LeTV video and audio content) and a network-wide work library (external video and audio content), and the amount of data is in the tens of millions. Our UV, PV, CV, VV are all confidential. So as a qualified employee………………I don’t know the value. In short, these data are the final data source, and it takes a workflow across multiple departments to finally appear in the search box that appears when the user clicks the search button. The general flow chart of http://www.fhadmin.org/ is as follows: 

  The reason why this flowchart is not hand-drawn as before, well, that is because: the pen is placed in the company.

  In addition to the two libraries on our side, the other box is a department. We use the offline service developed by me for the data delivery to the pipeline. The pipeline repeatedly merges data from various sources. That is, some videos have the same content, but may have different sources or have similar names but may not be exactly the same, but are actually one video. For example, I saw a movie called <a Cinderella story> when I was in college, and some translated into Chinese, some translated into "Cinderella Story" and some translated into "Cinderella's Glass Phone", but it can be judged according to its director and cast list. It's actually the same video. These same videos are to be aggregated into an album. The best description is recommended as the description of the album. Expand Details to have a sorted list of videos from each source.

  

  The normal whole network search will also put your own video in the front http://www.fhadmin.org/ : 

  You may have guessed this merge processing. Parallel computing uses mapreduce. Because it is a combined operation of video numbers, the order of magnitude is quite large. The data returned by the search engine are all IDs, and the real data is returned from the details department. What does it mean to be an agent? It is transparent to the calling end. If there is any modification, just change the proxy end.

  The search engine is an independent department and has never dealt with it. However, my company in my previous company was mainly responsible for the vertical search of the whole company. The data sources were all relational databases, and the amount of data was not large. At that time, the search engine used was solr. The tokenizer uses the open source IK tokenizer. Because the company was doing high-end contacts for entrepreneurs at that time, the search had many special requirements for the sorting of search results such as people's names, company names, etc., so at that time, the source code of the tokenizer was studied, and some tokenizers were also carried out to suit the project. transformation. For example, there is a requirement to filter the input html tags. But the first operation after reading the index in Solr is word segmentation, using Solr's own or external tokenizer. Then perform more detailed filtering or synonyms on the divided words. But this first step directly destroys the structure of the document and turns it into a word instead of an html document. Removal is cumbersome and inefficient. So what I did at the time was to directly modify the source code of the IK tokenizer, and the first operation to read the data was to filter the tags. Apache has ready-made tool classes. This avoids the performance loss caused by read-in and read-out. I also did a test at that time, and the execution efficiency was about 20 times higher than the way I modified the tokenizer and the solr filter to filter the regular expression by running 10w times in a loop. At that time, I also found that both solr and ES are based on lucene, which is more suitable for some searches in Western languages. Searches like Chinese do not require language processors to lowercase and then root words based on algorithms or matching, as in Western languages. Instead, more dictionary-based ones are needed, including synonyms and synonyms. Therefore, there is also a lot of room for optimization in terms of word segmentation.

  Personally, I think it is very important to do search data analysis. For example, from the log analysis, it can be found that some users enter the search keyword: Jia Yueting, then he is likely to be very interested in information containing the keyword "LeTV". After discovering this problem, I made a thesaurus for this kind of data, and carried out a two-way binding of some words on the search and index. It is equivalent to the function of a synonym. It is recommended to search the title of this article in several search engines and compare the results. It is quite interesting.

   详情数据也是存在文档型数据库里,其实用mangoDB挺合适的。但是公司有统一的cbase集群,就直接放到cbase里了。我经常需要跟人家解释半天:cbase,couchbase,memcached都是啥关系。memcached大家都很熟悉。但是memcached不支持持久化。如果使用单纯的memcached集群,节点失效时没有任何的容错。应对措施需要交由用户处理。所以就产生了一个加强版的memcached集群:couchbase。数据层以memcached API对数据进行交互,系统在memcached程序中嵌入持久化引擎代码对数据进行缓存,复制,持久化等操作,异步队列的形式将数据同步到CouchDB中。由于它实现了数据自动在多个节点本分,单节点失效不影响业务。支持自动分片,很容易在线维护集群。cbase又是啥东西呢?这是我司对couchbase进行了一个二次开发,主要的改进点是对value的最大值进行了强行扩容:本来memcached最大Value设定是1M。我们给扩容到4M。但是慎用大的value。value值从1K到不超过1M平均分布时,实际使用容量不超过50%时性能较好。如果大value很多,达不到这个值性能就会急剧下降。

 

  早在08年,09年的时候。facebook,mixi等国外知名互联网公司为了减少数据库访问次数,提高动态网页的访问速度,提高可扩展性,开始使用memcached。作为以facebook为标杆的人人网,这种技术也很快在其内部各个部门得到了普及。因为memcached集群采用的是服务器间互不通信的分布式方式。客户端和服务器端的通信采用的是分布式算法。这就是所说的节点失效时没有任何的容错。

  这里提一个概念,就是常见的容错机制。我知道的,主要是6种。

  ☆ failover:失败自动切换

    当出现失败,重试其他服务器,通常用于读操作,重试会带来更长延迟。

    像我们的MQ客户端配置,采用是failover为roundrobin。采用轮询调度算法来容错。

  ☆ failfast:快速失败

    只发起一次调用,失败立即报错,通常用于非幂等性的写操作。如果有机器正在重启,可能会出现调用失败。

    我们的一个数据库虽然升级成了mariaDB。但是还是一主多从。这时候写入主库失败采用的就是failfast方式。

  ☆ failsafe:失败安全

    出现异常时,直接忽略,通常用于写入日志等操作。

  ☆ failback:失败自动恢复

    后台记录失败请求,定时重发。通常用于消息通知操作。不可靠,重启丢失。

  ☆ forking:并行调用多个服务器

    只要一个成功即返回,通常用于实时性要求较高的读操作。需要浪费更多服务资源。

  ☆ broadcast

    广播调用,所有提供逐个调用,任意一台报错则报错。通常用于更新提供方本地状态,速度慢,任意一台报错则报错。

  

  读过《java并发编程实践》的朋友看到容错机制很容易会联想到java的fail-fast和fail-safe。周五和90后小伙子交流技术也正好聊到集合类的相关问题。有一个问题是在AbstractList的迭代器中,set操作做了expectedModCount = modCount。按理说不需要改变长度,为啥也要做这个操作。而实现它的子类set中都没有实现这个操作。我的想法是有一些实现set的方法有可能是通过添加删除来变相实现的。总之,继续于这个AbstractList的实现类都会检查这个expectedModCount 和 modCount的一致性。不一样会即可抛出并发修改异常,这就是failfast。而像CopyOnWriteArrayList这种的,写操作是在复制的集合上进行修改,不会抛出并发修改异常是failsafe的。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326350541&siteId=291194637