Elasticsearch搜索引擎存储(基本使用)_随笔

Elasticsearch搜索引擎存储(基本使用) 1.Elasticsearch相关概念

Elasticsearch中有几个基本概念，如节点、索引、文档等，下面分别说明一下。理解这些概念对熟悉Elasticsearch有帮助

节点和集群
Elasticsearch本质上是一个分布式数据库，允许多台服务器协同工作，每台服务器均可以运行多个Elasticsearch实例。
单个Elasticsearch实例称为一个节点(Node),一组节点构成一个集群(Cluster)。

索引
索引就是index，Elasticsearch会索引所有字段，经过处理后写入一个反向索引。查找数据时候，直接查找该索引。所以，Elasticsearch数据管理的顶层单位就叫做索引，其实相当于MySql、MongDB等中数据库的概念，值得注意的是，每个索引(及数据库)的名字必须小写。

文档
索引里的单条记录称为文档(document),许多条文档构成一个索引。
对同一个索引里面的文档，不要求有相同的结果(scheme)，但是结构最好保持一致，因为有利于效率。

类型
文档可以分组，例如weather这个文档，即可以按城市分组(北京和上海)，也可以按气候分组(晴天和雨天)。这种分组叫做类型(type)，他是虚拟的逻辑分组，用来过滤文档，类似mysql中的数据表，mongodb中的集合。
不同类型的文档因该具有相似的结构，举例来说，id字段不能再这个组中是字符串，在另一个组中却变成了数值。这点与关系型数据的表是不同的。因该把性质完全相同的数据存成两个索引，而不是把类型相同的数据存在一个索引

字段
每个文档都类似一个json结构，包含允许多字段，每个字段都有其对应的值，多个字段组成了一个文档，其实可以类比为mysql数据表中的字段，
在Elasticsearch中，文档归属于一种类型（type），而这些类型存在于索引中，我们可以画一个简单的对比图，与传统数据库的关系。
Relational DB --> Databases --> Tables --> Rows – >Columns
ElasticSearch --> indices -->types --> documents – > fields

2.准备工作

首先要确保Es已经安装好，安装方式可参考：

https://cuiqingcai.com/31085.html
来自崔大神的博客

安装好之后确认它可在9200端口上正常运行。
在本地9200端口看到：

{
  "name" : "DESKTOP-3BGU5UH",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "rh-hJvleSkysw4eED99kUw",
  "version" : {
    "number" : "7.17.0",
    "build_flavor" : "default",
    "build_type" : "zip",
    "build_hash" : "bee86328705acaa9a6daede7140defd4d9ec56bd",
    "build_date" : "2022-01-28T08:36:04.875279988Z",
    "build_snapshot" : false,
    "lucene_version" : "8.11.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

python *** 作es的库是

pip3 install elasticsearch

3.创建索引

from elasticsearch import Elasticsearch

es = Elasticsearch()
result = es.indices.create(index="news", ignore=400)
print(result)

这里我们首先创建一个Elasticsearch对象，并且没有设置任何参数，默认情况下他会链接本地9200端口的es服务，我们也可以设定特定的连接信息，如：

es = Elasticsearch(["https://[username:password@]hostname:post"], verify_certs=True)  # 是否验证SSL证书

运行结果如下：

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'news'}

可以看到，返回结果是JSON类型，其中acknowledge字段表示创建 *** 作执行成功。
如果在执行一边就会出现：

{'error': {'root_cause': [{'type': 'resource_already_exists_exception', 'reason': 'index [news/TNEjlIt8SmyuI0veOvQCXg] already exists', 'index_uuid': 'TNEjlIt8SmyuI0veOvQCXg', 'index': 'news'}], 'type': 'resource_already_exists_exception', 'reason': 'index [news/TNEjlIt8SmyuI0veOvQCXg] already exists', 'index_uuid': 'TNEjlIt8SmyuI0veOvQCXg', 'index': 'news'}, 'status': 400}

创建失败，其中status状态码为400，表示错误原因是索引已存在。
注意在这里的代码中，我们使用ignore参数为400，说明如果返回结果是400的话就忽略这个错误不报错
如果不加就会出现：

  File "D:python3网络爬虫venvlibsite-packageselasticsearchtransport.py", line 466, in perform_request
    raise e
  File "D:python3网络爬虫venvlibsite-packageselasticsearchtransport.py", line 434, in perform_request
    timeout=timeout,
  File "D:python3网络爬虫venvlibsite-packageselasticsearchconnectionhttp_urllib3.py", line 291, in perform_request
    self._raise_error(response.status, raw_data)
  File "D:python3网络爬虫venvlibsite-packageselasticsearchconnectionbase.py", line 329, in _raise_error
    status_code, error_message, additional_info
elasticsearch.exceptions.RequestError: RequestError(400, 'resource_already_exists_exception', 'index [news/TNEjlIt8SmyuI0veOvQCXg] already exists')

所以我们擅用ignore参数，把一些意外情况排除，程序不会中断。

4.删除索引

from elasticsearch import Elasticsearch

es = Elasticsearch(verify_certs=True)  # 是否验证SSL证书
result = es.indices.delete(index="news", ignore=[400, 404])
print(result)

ignore的作用和上面一致
返回结果为：

{'acknowledged': True}

如果再次删除：

{'error': {'root_cause': [{'type': 'index_not_found_exception', 'reason': 'no such index [news]', 'resource.type': 'index_or_alias', 'resource.id': 'news', 'index_uuid': '_na_', 'index': 'news'}], 'type': 'index_not_found_exception', 'reason': 'no such index [news]', 'resource.type': 'index_or_alias', 'resource.id': 'news', 'index_uuid': '_na_', 'index': 'news'}, 'status': 404}

5.插入数据

Elasticsearch就像MongoDB一样，再插入数据的时候直接插入结构华字典数据，插入数据可调用create方法。例如我们这里插入一段新闻。

from elasticsearch import Elasticsearch

es = Elasticsearch(verify_certs=True)  # 是否验证SSL证书
es.indices.create(index="news", ignore=[400, 404])
data = {
    "title": "乘风破浪不负韶华，奋斗青春圆梦高考",
    "url": "https://view.inews.qq.com/a/EDU2021041600732200"
}
result = es.create(index="news", id=1, body=data)
print(result)

传入3个参数，index代表索引名称、id是数据的唯一标识、body文档内容结果如下：

{'_index': 'news', '_type': '_doc', '_id': '1', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1}

也可以用index方法来插入数据，create需要指定id字段来唯一标识数据，而index可以不指定id，那么它会自动产生一个id

result = es.index(index="news", body=data)

create方法内部其实调用了index方法，是对index方法的封装

6.更新数据

更新数据也非常简单，我们同样需要指定数据的id和内容，调用update即可，代码如下：

from elasticsearch import Elasticsearch

es = Elasticsearch()

data = {
    'title': '乘风破浪不负韶华，奋斗青春圆梦高考',
    'url': 'http://view.inews.qq.com/a/EDU2021041600732200',
    'date': '2021-07-05'
}
result = es.update(index='news', body=data, id=1)
print(result)

这里这样修改会遇到问题可能和崔大神出书版本有关系：

elasticsearch.exceptions.RequestError: RequestError(400, 'x_content_parse_exception', '[1:2] [UpdateRequest] unknown field [title]')

例如：

from elasticsearch import Elasticsearch

es = Elasticsearch()

data = {
    "doc":{
        'title': '乘风破浪不负韶华，奋斗青春圆梦高考',
        'url': 'http://view.inews.qq.com/a/EDU2021041600732200',
        'date': '2021-07-09'
    }

}
result = es.update(index='news', body=data, id=1)
print(result)

结果：

{'_index': 'news', '_type': '_doc', '_id': '1', '_version': 2, 'result': 'updated', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 8, '_primary_term': 1}

7.删除数据

from elasticsearch import Elasticsearch

es = Elasticsearch()

result = es.delete(index='news', id=1)
print(result)

运行结果如下：

{'_index': 'news', '_type': '_doc', '_id': '1', '_version': 3, 'result': 'deleted', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 11, '_primary_term': 1}

删除成功 _version：3 版本发生改变第一次为创建第二次为更新，第三次为删除

8.查询数据

对于上面的 *** 作都很基础和简单，es真正强大的地方在于它的检索功能
对于中文来说，我们需要安装一个分词插件，使用的是elasticsearch-analysis-ik。我们用es的另一个命令行工具elasticsearch-plugin来安装这个插件，这里安装的版本是7.13.2要和es版本对应起来，命令是：

elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.13.2/elasticsearch-analysis-ik-7.13.2.zip

安装好之后重启es就好。

首先我们重新建一个索引并指定分词的字段，相应代码如下：

from elasticsearch import Elasticsearch

es = Elasticsearch()
mapping = {
    'properties': {
        'title': {
            'type': 'text',
            'analyzer': 'ik_max_word',
            'search_analyzer': 'ik_max_word'
        }
    }
}
es.indices.delete(index='news', ignore=[400, 404])
es.indices.create(index='news', ignore=400)
result = es.indices.put_mapping(index='news', body=mapping)
print(result)

这里我们先将之前的索引删除，然后创建新的索引，接着更新他的mapping信息。mapping信息中指定分词的字段，包含字段的类型type、分词器analyzer和搜索器search_analyzer。指定搜索分词器search_analyzer为ik_max_word表示使用我们刚才安装的中文分词插件，如果不指定则会使用默认的英文分词器。
接下来插入几条数据。

from elasticsearch import Elasticsearch

es = Elasticsearch()

datas = [
    {
        'title': '高考结局大不同',
        'url': 'https://k.sina.com.cn/article_7571064628_1c3454734001011lz9.html',
    },
    {
        'title': '进入职业大洗牌时代，“吃香”职业还吃香吗？',
        'url': 'https://new.qq.com/omn/20210828/20210828A025LK00.html',
    },
    {
        'title': '乘风破浪不负韶华，奋斗青春圆梦高考',
        'url': 'http://view.inews.qq.com/a/EDU2021041600732200',
    },
    {
        'title': '他，活出了我们理想的样子',
        'url': 'https://new.qq.com/omn/20210821/20210821A020ID00.html',
    }
]

for data in datas:
    es.index(index='news', body=data)

这里指定了4条数据他们都带有title和url字段，然后通过index方法传入将他们插入es中索引名称为news
接下来我们更具关键词查询一下相关内容：

result = es.search(index="news")
print(result)

结果如下：

{'took': 1, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 4, 'relation': 'eq'}, 'max_score': 1.0, 'hits': [{'_index': 'news', '_type': '_doc', '_id': 'PsLjuH4BJ6CKg0kw5yYC', '_score': 1.0, '_source': {'title': '高考结局大不同', 'url': 'https://k.sina.com.cn/article_7571064628_1c3454734001011lz9.html'}}, {'_index': 'news', '_type': '_doc', '_id': 'P8LjuH4BJ6CKg0kw5yZ8', '_score': 1.0, '_source': {'title': '进入职业大洗牌时代，“吃香”职业还吃香吗？', 'url': 'https://new.qq.com/omn/20210828/20210828A025LK00.html'}}, {'_index': 'news', '_type': '_doc', '_id': 'QMLjuH4BJ6CKg0kw5yaE', '_score': 1.0, '_source': {'title': '乘风破浪不负韶华，奋斗青春圆梦高考', 'url': 'http://view.inews.qq.com/a/EDU2021041600732200'}}, {'_index': 'news', '_type': '_doc', '_id': 'QcLjuH4BJ6CKg0kw5yaL', '_score': 1.0, '_source': {'title': '他，活出了我们理想的样子', 'url': 'https://new.qq.com/omn/20210821/20210821A020ID00.html'}}]}}

可以看到，这里查询出了插入的4条数据。他们出现在hits字段里面，其中total字段标明了查询的结果条目数，max_score代表了最大匹配分数。
另外，我们还可以进行全文检索，这才是体现es搜索引擎特性的地方：

from elasticsearch import Elasticsearch
import json

dsl = {
    'query': {
        'match': {
            'title': '高考 圆梦'
        }
    }
}

es = Elasticsearch()
result = es.search(index='news', body=dsl)
print(result)

这里我们使用es支持的dsl语句来进行查询，使用match指定全文搜索，检索的字段是title，内容是高考圆梦，搜索内容如下:

{'took': 1, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 2, 'relation': 'eq'}, 'max_score': 1.7796917, 'hits': [{'_index': 'news', '_type': '_doc', '_id': 'PMLiuH4BJ6CKg0kwGSbH', '_score': 1.7796917, '_source': {'title': '乘风破浪不负韶华，奋斗青春圆梦高考', 'url': 'http://view.inews.qq.com/a/EDU2021041600732200'}}, {'_index': 'news', '_type': '_doc', '_id': 'OsLiuH4BJ6CKg0kwGSYl', '_score': 0.81085134, '_source': {'title': '高考结局大不同', 'url': 'https://k.sina.com.cn/article_7571064628_1c3454734001011lz9.html'}}]}}

还可以用json形式：

from elasticsearch import Elasticsearch
import json
es = Elasticsearch()


dsl = {
    'query': {
        'match': {
            'title': '高考 圆梦'
        }
    }
}

result = es.search(index='news', body=dsl)
print(json.dumps(result, indent=2, ensure_ascii=False))

结果如下更加清晰了：

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.7796917,
    "hits": [
      {
        "_index": "news",
        "_type": "_doc",
        "_id": "TMLluH4BJ6CKg0kw9SZ3",
        "_score": 1.7796917,
        "_source": {
          "title": "乘风破浪不负韶华，奋斗青春圆梦高考",
          "url": "http://view.inews.qq.com/a/EDU2021041600732200"
        }
      },
      {
        "_index": "news",
        "_type": "_doc",
        "_id": "SsLluH4BJ6CKg0kw9Cbi",
        "_score": 0.81085134,
        "_source": {
          "title": "高考结局大不同",
          "url": "https://k.sina.com.cn/article_7571064628_1c3454734001011lz9.html"
        }
      }
    ]
  }
}

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5720153.html

Elasticsearch搜索引擎存储(基本使用)

发表评论

评论列表（0条）