ElasticSearch

Coded in Java.

Elasticsearch是一个近似实时的搜索平台,从索引文档到可搜索有些延迟,通常为1秒。

Elasticsearch 2.3.2 基于Lucene 5.5

ElasticSearch + Kibana + Logstash(fluentd)

  • 关系数据库(Relational DB)->库(Databases)->表(Tables)->行(Rows)->列(Columns)
  • Elasticsearch->索引(Indices)->类型(Types)->文档(Documents)->字段(Fields)

Elasticsearch集群可以包含多个索引(indices)(数据库),这些库可以包含多个类型(types)(表),这些类型包含多个文档(documents)(行),然后每个文档包含多个字段(Fields)(列)。

相关术语

  1. segmentation:分词;Token Filter 词条过滤器
  2. terms : 术语/分词,是一个被索引的精确值
  3. tokenization:分词。为了创建倒排索引,需要先将各文档中域的值切分为独立的单词(也称为term或token),而后将之创建为一个无重复的有序单词列表。这个过程称之为“分词(tokenization)”。
  4. Stopword : 停用词
  5. Stemming: 词干
  6. Faceted Search : 分面搜索
  7. First Principles : 基本原则
  8. fuzzy matching : 模糊匹配
  9. Search As You Type : 即敲即搜(当你在敲搜索关键字的时候会即时提供搜索选项建议).
  10. did-you-mean suggestion : 搜索纠错
  11. Elasticsearch Index: a place to store related data. In reality, an index is just a logical namespace that points to one or more physical shards.

    索引是有几分相似属性的一系列文档的集合。如nginx日志索引、syslog索引等等。索引是由名字标识,名字必须全部小写。这个名字用来进行索引、搜索、更新和删除文档的操作。索引相对于关系型数据库的库。

  12. Shard: 是一个最小级别“工作单元(worker unit)”,它只是保存了索引中所有数据的一部分,我们的文档存储在分片中,并且在分片中被索引

Technically,a shard is a directory of files where Lucene stores the data for your index. A shard is also the smallest unit that Elasticsearch moves from node to node.

  1. 节点:节点就是一台单一的服务器,是集群的一部分,存储数据并参与集群的索引和搜索功能。像集群一样,节点也是通过名字来标识,默认是在节点启动时随机分配的字符名。当然啦,你可以自己定义。该名字也蛮重要的,在集群中用于识别服务器对应的节点。
节点可以通过指定集群名字来加入到集群中。默认情况下,每个节点被设置成加入到elasticsearch集群。如果启动了多个节点,假设能自动发现对方,他们将会自动组建一个名为elasticsearch的集群。
  1. 类型(type) : 在一个索引中,可以定义一个或多个类型。类型是一个逻辑类别还是分区完全取决于你。通常情况下,一个类型被定于成具有一组共同字段的文档。如ttlsa运维生成时间所有的数据存入在一个单一的名为logstash-ttlsa的索引中,同时,定义了用户数据类型,帖子数据类型和评论类型。类型相对于关系型数据库的表
  2. 文档(document) : 文档是信息的基本单元,可以被索引的。文档是以JSON格式表现的。

    在类型中,可以根据需求存储多个文档。

    虽然一个文档在物理上位于一个索引,实际上一个文档必须在一个索引内被索引和分配一个类型。

    文档相对于关系型数据库的行

  3. 聚合(aggregations) : 它允许你在数据基础上生成复杂的统计。它很像SQL中的GROUP BY但是功能更强大。

  4. By default, each index is made up of five primary shards, each with one replica,for a total of ten shards

  5. Analysis is the process of parsing the text to transform it and break it down into elements to make searches relevant.

    分析分为4个阶段:

    • Character Filtering : are used to tranform particular character sequences into other character sequences.
    • Breaking into tokens : Lucene itself doesn't act on large strings of data; instead, it acts on what are known as tokens.
    • Token filtering : These token filters take a token as input and can modify, add, or remove tokens as needed.
    • Token Indexing :
  6. ElasticSearch field data : we’re talking about all of the unique values for a field.

  7. Cache Churn : 内存抖动

  8. Segment : Each segment is a miniature Apache Lucene index on its own and can have different size.

  9. FST (Finite State Transducer 有穷状态转换器) to implement completion suggester.

  10. By default, an Elasticsearch node is all of the following types: master-eligible, data, ingest, and machine learning (if available).

    1. master_node, which controls the cluster.The master node is responsible for lightweight cluster-wide actions such as creating or deleting an index, tracking which nodes are part of the cluster, and deciding which shards to allocate to which nodes. It is important for cluster health to have a stable master node.
    2. Data nodes hold data and perform data related operations such as CRUD, search, and aggregations.
    3. Ingest nodes are able to apply an ingest pipeline to a document in order to transform and enrich the document before indexing --- 摄取节点能够将摄取管道应用于文档,以便在建立索引之前转换和丰富文档。
    4. Machine learning node : A node that has xpack.ml.enabled and node.ml set to true, which is the default behavior in the Elasticsearch default distribution.

    5. Coordinating node: Every node is implicitly a coordinating node. coordinating only nodes behave as smart load balancers.

  11. Index Lifecycle Management provides users with many of the most common index management features as a matter of policy --- 索引生命周期管理根据策略为用户提供了许多最常见的索引管理功能

  12. We wish to roll over the index after it reaches a size of 50 gigabytes, or has been created 30 days ago, and then delete the index after 90 days --- 我们希望在索引达到50 GB或已在30天前创建索引后对其进行翻转,然后在90天后删除该索引。

Elasticsearch Index Lifecycle Management

  1. ILM state: GET _ilm/status
  2. Get all ILM Policies: GET _ilm/policy/
  3. Get Specific ILM Policy: GET _ilm/policy/logstash_policy/

References

  1. Elasticsearch 权威指南(中文版)

  2. Elasticsearch 开源分布式搜索引擎入门指南(下)

  3. 容器的性能监控和日志管理

  4. Lucene Query Syntax
  5. grokdebug
  6. logstash grok filter for logs with arbitrary attribute-value pairs
  7. ELK 日志中心分布式架构的逐步演化 . 推酷
  8. Java REST Client Logging Configuration
  9. Important Elasticsearch configuration
  10. Explain Lifecycle API
  11. [ElasticStack] ES索引生命周期管理

results matching ""

    No results matching ""