深耕ElasticSearch - 批量 *** 作文档_随笔

深耕ElasticSearch - 批量 *** 作文档

文章目录

- 1. 批量查询
- 2. 批量创建/更新/删除文档
- - 2.1 删除文档
  - 2.2 强制创建文档
  - 2.3 索引文档
  - 2.4 全量替换文档
  - 2.5 部分更新文档
  - 2.6 不要重复指定Index

1. 批量查询

ES的速度已经很快了，但甚至能更快。将多个请求合并成一个，避免单独处理每个请求花费的网络延时和开销。如果你需要从ES检索很多文档，那么使用 multi-get或者 mget API 来将这些检索请求放在一个请求中，将比逐个文档请求更快地检索到全部文档。

如果一条一条的查询，要查询100条数据，那么就要发送100次网络请求，这个开销还是很大的，如果进行批量查询的话，查询100条数据，就只要发送1次网络请求，网络请求的性能开销缩减100倍。

# 一条一条的查询
GET /test_index/_doc/1
GET /test_index/_doc/2

mget API 要求有一个 docs 数组作为参数，每个元素包含需要检索文档的元数据，包括 _index 、 _type 和 _id 。如果你想检索一个或者多个特定的字段，那么你可以通过 _source 参数来指定这些字段的名字。

1、构造3条数据：

PUT /test_index/_doc/1
{
  "test field":"test1"
}

PUT /test_index/_doc/2
{
  "test field":"test2"
}

PUT /test_index/_doc/3
{
  "test field":"test3"
}

2、批量查询：

GET /_mget
{
  "docs":[
    {
      "_index":"test_index",
      "_type":"_doc",
      "_id":"1"
    },
      {
      "_index":"test_index",
      "_type":"_doc",
      "_id":"2"
    }
 ]
}

#! Deprecation: [types removal] Specifying types in multi get requests is deprecated.
{
  "docs" : [
    {
      "_index" : "test_index",
      "_type" : "_doc",
      "_id" : "1",
      "_version" : 1,
      "_seq_no" : 0,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "test field" : "test1"
      }
    },
    {
      "_index" : "test_index",
      "_type" : "_doc",
      "_id" : "2",
      "_version" : 1,
      "_seq_no" : 1,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "test field" : "test2"
      }
    }
  ]
}

3、新版本的es中已经将type去除了，不推荐在mget 请求中指定type，重新请求：

GET /_mget
{
  "docs":[
    {
      "_index":"test_index",
      "_id":"1"
    },
      {
      "_index":"test_index",
      "_id":"2"
    }
 ]
}

4、如果查询的document在同一个index下面，可以用下面的查询语法：

GET /test_index/_doc/_mget
{
   "ids": [1, 2]
}

#! Deprecation: [types removal] Specifying types in multi get requests is deprecated.
{
  "docs" : [
    {
      "_index" : "test_index",
      "_type" : "_doc",
      "_id" : "1",
      "_version" : 1,
      "_seq_no" : 0,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "test field" : "test1"
      }
    },
    {
      "_index" : "test_index",
      "_type" : "_doc",
      "_id" : "2",
      "_version" : 1,
      "_seq_no" : 1,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "test field" : "test2"
      }
    }
  ]
}

同样的，新版本的es中已经将type去除了，因此可以不指定type：

GET /test_index/_mget
{
   "ids": [1, 2]
}

可以说mget是很重要的，一般来说，在进行查询的时候，如果一次性要查询多条数据的话，那么一定要用batch批量 *** 作的api，尽可能减少网络开销次数，可能可以将性能提升数倍，甚至数十倍，非常非常之重要。

2. 批量创建/更新/删除文档

与 mget 可以使我们一次取回多个文档同样的方式， bulk API 允许在单个步骤中进行多次 create 、 index 、 update 或 delete 请求。

bulk 与其他的请求体格式稍有不同，如下所示：

{ action: { metadata }}n
{ request body        }n
{ action: { metadata }}n
{ request body        }n
...

action/metadata 行指定哪一个文档做什么 *** 作。

action 必须是以下选项之一:

create

如果文档不存在，那么就创建它。
index

创建一个新文档或者替换一个现有的文档。
update

部分更新一个文档。
delete

删除一个文档。

metadata 应该指定被索引、创建、更新或者删除的文档的 _index 、 _type 和 _id 。

注意：bulk api对json的语法，有严格的要求，每个json串不能换行，只能放一行，同时一个json串和一个json串之间，必须有一个换行

2.1 删除文档

1、构造数据

PUT /test_index/_doc/1
{
  "test field":"test1"
}

PUT /test_index/_doc/2
{
  "test field":"test2"
}

PUT /test_index/_doc/3
{
  "test field":"test3"
}

2、删除数据

POST /_bulk
{ "delete":{"_index":"test_index","_type":"_doc","_id":1}}

3、批量删除2个文档，即在单个步骤中进行2次delete请求

POST /_bulk
{"delete":{"_index":"test_index","_type":"_doc","_id":1}}
{"delete":{"_index":"test_index","_type":"_doc","_id":3}}

2.2 强制创建文档

在单个步骤中进行多次delete和create 请求。

POST /_bulk
{"delete":{"_index":"test_index","_type":"_doc","_id":1}}
{"delete":{"_index":"test_index","_type":"_doc","_id":3}}
{"create":{"_index":"test_index","_type":"_doc","_id":1}}
{"test field":"test1"}
{"create":{"_index":"test_index","_type":"_doc","_id":3}}
{"test field":"test3"}

2.3 索引文档

POST /_bulk
{"delete":{"_index":"test_index","_type":"_doc","_id":1}}
{"create":{"_index":"test_index","_type":"_doc","_id":1}}
{"test field":"test1"}
{"index":{"_index":"test_index","_type":"_doc","_id":2}}
{"test field":"test2"}

2.4 全量替换文档

POST /_bulk
{"delete":{"_index":"test_index","_type":"_doc","_id":1}}
{"create":{"_index":"test_index","_type":"_doc","_id":1}}
{"test field":"test1"}
{"index":{"_index":"test_index","_type":"_doc","_id":2}}
{"test field":"test2"}
{"index":{"_index":"test_index","_type":"_doc","_id":2}}
{"test field":"test22"}

2.5 部分更新文档

把所有的 *** 作组合在一起，一个完整的 bulk 请求有以下形式。

POST /_bulk
{"delete":{"_index":"test_index","_type":"_doc","_id":1}}
{"create":{"_index":"test_index","_type":"_doc","_id":1}}
{"test field":"test1"}
{"index":{"_index":"test_index","_type":"_doc","_id":2}}
{"test field":"test2"}
{"index":{"_index":"test_index","_type":"_doc","_id":2}}
{"test field":"test22"}
{"update":{ "_index":"test_index","_type":"_doc","_id": "1"}}
{"doc":{"test_field":"bulk test2"}}

这个 Elasticsearch 响应包含 items 数组，这个数组的内容是以请求的顺序列出来的每个请求的结果。

#! Deprecation: [types removal] Specifying types in bulk requests is deprecated.
{
  "took" : 77,
  "errors" : false,
  "items" : [
    {
      "delete" : {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_version" : 9,
        "result" : "deleted",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 20,
        "_primary_term" : 1,
        "status" : 200
      }
    },
    {
      "create" : {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_version" : 10,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 21,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_version" : 7,
        "result" : "updated",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 22,
        "_primary_term" : 1,
        "status" : 200
      }
    },
    {
      "index" : {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_version" : 8,
        "result" : "updated",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 23,
        "_primary_term" : 1,
        "status" : 200
      }
    },
    {
      "update" : {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_version" : 11,
        "result" : "updated",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 24,
        "_primary_term" : 1,
        "status" : 200
      }
    }
  ]
}

每个子请求都是独立执行，因此某个子请求的失败不会对其他子请求的成功与否造成影响。如果其中任何子请求失败，最顶层的 error 标志被设置为 true ，并且在相应的请求报告出错误明细：

POST /_bulk
{"create":{"_index":"test_index","_type":"_doc","_id":1}}
{"test field":"test1"}
{"index":{"_index":"test_index","_type":"_doc","_id":1}}
{"test field":"test2"}

{
  "took" : 7,
  "errors" : true,
  "items" : [
    {
      "create" : {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "status" : 409,
        "error" : {
          "type" : "version_conflict_engine_exception",
          "reason" : "[1]: version conflict, document already exists (current version [11])",
          "index_uuid" : "lokVYUtTTJG2TwWqCmyxzw",
          "shard" : "0",
          "index" : "test_index"
        }
      }
    },
    {
      "index" : {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_version" : 12,
        "result" : "updated",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 26,
        "_primary_term" : 1,
        "status" : 200
      }
    }
  ]
}

在响应中，我们看到 create 文档 1 失败，因为它已经存在。但是随后的 index 请求，也是对文档 1 *** 作，就成功了。这也意味着 bulk 请求不是原子的：不能用它来实现事务控制。每个请求是单独处理的，因此一个请求的成功或失败不会影响其他的请求。

2.6 不要重复指定Index

也许你正在批量索引日志数据到相同的 index 和 type 中。但为每一个文档指定相同的元数据是一种浪费。相反，可以像 mget API 一样，在 bulk 请求的 URL 中接收默认的 /_index 或者 /_index/_type ：

POST /test_index/_doc/_bulk
{"index":{}}
{"test field":"test1"}

因为新版本es中type已经移除，因此可以不用指定type:

POST /test_index/_bulk
{"index":{}}
{"test field":"test2"}

整个批量请求都需要由接收到请求的节点加载到内存中，因此该请求越大，其他请求所能获得的内存就越少。批量请求的大小有一个最佳值，大于这个值，性能将不再提升，甚至会下降。但是最佳值不是一个固定的值。它完全取决于硬件、文档的大小和复杂度、索引和搜索的负载的整体情况。

幸运的是，很容易找到这个 最佳点 ：通过批量索引典型文档，并不断增加批量大小进行尝试。当性能开始下降，那么你的批量大小就太大了。一个好的办法是开始时将 1,000 到 5,000 个文档作为一个批次, 如果你的文档非常大，那么就减少批量的文档个数。

密切关注你的批量请求的物理大小往往非常有用，一千个 1KB 的文档是完全不同于一千个 1MB 文档所占的物理大小。一个好的批量大小在开始处理后所占用的物理大小约为 5-15 MB。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5678082.html

深耕ElasticSearch - 批量 *** 作文档

发表评论

评论列表（0条）