Elasticsearch系列——python中的GIL锁、es的文档查询 *** 作、ik分词器的使用、Python中集成es两种方式、集群搭建（脑裂）、打分机制_随笔

Elasticsearch系列——python中的GIL锁、es的文档查询 *** 作、ik分词器的使用、Python中集成es两种方式、集群搭建（脑裂）、打分机制

文章目录

昨日回顾python中的GIL锁一文档查询 *** 作

1 match和term查询2 排序查询3 分页查询4 布尔查询5 查询结果过滤6 高亮查询7 聚合函数二 ik分词器使用二 Python中集成es两种方式

1 原生集成2 dsl集成三集群搭建（脑裂）四打分机制

昨日回顾

#1  装了es，jdk环境
#2  装了kibana(官方提供，配置连接哪个es)，postman也可以
#3  装了elsaticsearch-head（第三方，存在跨域，修改es配置） npm install   npm run start

#4 索引的增删查改：分片数量，备份数量（改备份数量）
#5 映射管理：表的创建，字段和字段属性（string:keyword,text）,keyword不会分词直接建索引，text会分词再建索引

#6 倒排索引：对一篇文章先进行分词，然后对每个词建立索引。。。。。。。。。。而正向索引是根据文章标题建立索引

#7 文档的增删查改（改有两种情况：覆盖和更新）

#8 查询：字符串查询（不常用）和结构化查询
						-match
					    -match_all 
					    -match_phrase 短语查询
					    -match_phrase---slop是字之间隔多少

python中的GIL锁

GIL:全局解释器锁，cpython解释器存在的，其他解释器jpython，pypy：gil锁
为什么cpython存在这个问题，我们大量的使用？
	-大量的第三方模块，内置模块都是基于cpython写起来的
    
cpython中多线程的运行，必须抢到gil锁，才能运行（GIL其实就是个大的互斥锁，把原来本应该并行的，变成串行）
线程是cpu调度的最小单位，一个进程下起了3个线程，
在同一进程下，同一时刻，只有一条线程在执行，所以不能利用多核优势
开跟cpu核数相同的线程：由于有gil锁，其实同一时刻只有一条线程在执行，所以cpu肯定不会百分百
开跟cpu核数相同的进程：gil只能锁住当前python解释器的进程内的线程，多个进程内的线程会被多个cpu调度执行，所以cpu会百分百占满

只存在于cpython解释器
计算密集型（用cpu），开多进程
io密集型(不太用cpu),开多线程

python2中，遇到io或者代码执行一定的行数，会释放gil
python3中，遇到io或者时间到了，会释放gil锁

一文档查询 *** 作 1 match和term查询

# 并且和或者的条件
#并且
GET t3/doc/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "beautiful"
          }
        },
        {
          "match": {
            "desc": "beautiful"
          }
        }
      ]
    }
  }
}

#或者
GET t3/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": "beautiful"
          }
        },
        {
          "match": {
            "desc": "beautiful"
          }
        }
      ]
    }
  }
}



# match，term和terms的区别
	-match查的短语会分词
    GET w10/_doc/_search
        {
          "query": {
            "match": {
              "t1": "Beautiful girl!"
            }
          }
        }
    -term查的不会分词
    GET w10/_doc/_search
            {
          "query": {
            "term": {
              "t1": "girl"
            }
          }
        }
    -terms由于部分词，想查多个，terms
        GET w10/_doc/_search
        {
          "query": {
            "terms": {
              "t1": ["beautiful", "sexy"]
            }
          }
        }
        
        
        
# pymysql   原生 *** 作，查出字典
# orm       orm直接转成对象

2 排序查询

##### 不是所有字段都支持排序，只有数字类型，字符串不支持

GET lqz/_doc/_search
{
  "query": {
    "match": {
      "from": "gu"
    }
  }
}

#降序
GET lqz/_doc/_search
{
  "query": {
    "match": {
      "from": "gu"
    }
  },
  "sort": [
    {
      "age": {
        "order": "desc"
      }
    }
  ]
}

## 升序
GET lqz/_doc/_search
{
  "query": {
    "match": {
      "from": "gu"
    }
  },
  "sort": [
    {
      "age": {
        "order": "asc"
      }
    }
  ]
}


GET lqz/_doc/_search
{
  "query": {
    "match_all": {
    }
  },
  "sort": [
    {
      "age": {
        "order": "asc"
      }
    }
  ]
}

3 分页查询

#从第二条开始，取一条
GET lqz/_doc/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "age": {
        "order": "desc"
      }
    }
  ]
}

GET lqz/_doc/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "age": {
        "order": "desc"
      }
    }
  ], 
  "from": 2,
  "size": 2
}




###注意：对于`elasticsearch`来说，所有的条件都是可插拔的，彼此之间用`,`分割
GET lqz/_doc/_search
{
  "query": {
    "match_all": {}
  }, 
  "from": 2,
  "size": 2
}

4 布尔查询

- must（and）
- should（or）
- must_not（not）

##布尔查询之must 多个的话是and条件
GET lqz/_doc/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "from": "gu"
          }
        },
        {
          "match": {
            "name": "顾老二"
          }
        }
      ]
    }
  }
}


##布尔查询之should    多个是or条件
GET lqz/_doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "from": "gu"
          }
        },
        {
          "match": {
            "name": "龙套偏房"
          }
        }
      ]
    }
  }
}





### must_not条件   多个 都不是  的条件
GET lqz/_doc/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "match": {
            "from": "gu"
          }
        },
        {
          "match": {
            "tags": "可爱"
          }
        },
        {
          "match": {
            "age": 18
          }
        }
      ]
    }
  }
}




###filter，大于小于的条件   gt lt  gte  lte
GET lqz/_doc/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "from": "gu"
          }
        }
      ],
      "filter": {
        "range": {
          "age": {
            "lt": 30
          }
        }
      }
    }
  }
}


### 范围查询
GET lqz/_doc/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "from": "gu"
          }
        }
      ],
      "filter": {
        "range": {
          "age": {
            "gte": 25,
            "lte": 30
          }
        }
      }
    }
  }
}


### filter需要在bool内部，并且如果是and条件，需要用must，如果使用了should，会认为是should和filter是或者的关系

must：与关系，相当于关系型数据库中的and。should：或关系，相当于关系型数据库中的or。must_not：非关系，相当于关系型数据库中的not。filter：过滤条件。range：条件筛选范围。gt：大于，相当于关系型数据库中的>。gte：大于等于，相当于关系型数据库中的>=。lt：小于，相当于关系型数据库中的<。lte：小于等于，相当于关系型数据库中的<=。 5 查询结果过滤

###基本使用
GET lqz/_doc/_search
{
  "query": {
    "match_all": {
      }
  },
  "_source":["name","age"]
}


####_source和query是平级的

GET lqz/_doc/_search
{
  "query": {
    "bool": {
      "must":{
        "match":{"from":"gu"}
      },
      
      "filter": {
        "range": {
          "age": {
            "lte": 25
          }
        }
      }
    }
  },
  "_source":["name","age"]
}

6 高亮查询

GET lqz/_doc/_search
{
  "query": {
    "match": {
      "from": "gu"
    }
  },
  "highlight": {
    "pre_tags": "",
    "post_tags": "",
    "fields": {
      "from": {}
    }
  }
}

7 聚合函数
# sum ,avg, max ,min # select max(age) as my_avg from 表 where from=gu; GET lqz/_doc/_search { "query": { "match": { "from": "gu" } }, "aggs": { "my_avg": { "avg": { "field": "age" } } }, "_source": ["name", "age"] } #最大年龄 GET lqz/_doc/_search { "query": { "match": { "from": "gu" } }, "aggs": { "my_max": { "max": { "field": "age" } } }, "_source": ["name", "age"] } #最小年龄 GET lqz/_doc/_search { "query": { "match": { "from": "gu" } }, "aggs": { "my_min": { "min": { "field": "age" } } }, "_source": ["name", "age"] } # 总年龄 #最小年龄 GET lqz/_doc/_search { "query": { "match": { "from": "gu" } }, "aggs": { "my_sum": { "sum": { "field": "age" } } }, "_source": ["name", "age"] } #分组 # 现在我想要查询所有人的年龄段，并且按照`15~20，20~25,25~30`分组，并且算出每组的平均年龄。 GET lqz/_doc/_search { "size": 0, # 可以不见，但是会返回详细数据，没必要用，就用size限制一下 "query": { "match_all": {} }, "aggs": { "age_group": { "range": { "field": "age", "ranges": [ { "from": 15, "to": 20 }, { "from": 20, "to": 25 }, { "from": 25, "to": 30 } ] } } } }
二 ik分词器使用
#1 github下载相应版本 https://github.com/medcl/elasticsearch-analysis-ik/releases?after=v7.5.2 # 2 解压到es的plugin目录下 # 3 重启es # ik_max_word 和 ik_smart 什么区别? ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合，适合 Term Query； ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”，适合 Phrase 查询。 PUT books { "mappings": { "properties":{ "title":{ "type":"text", "analyzer": "ik_max_word" }, "price":{ "type":"integer" }, "addr":{ "type":"keyword" }, "company":{ "properties":{ "name":{"type":"text"}, "company_addr":{"type":"text"}, "employee_count":{"type":"integer"} } }, "publish_date":{"type":"date","format":"yyy-MM-dd"} } } } PUT books/_doc/1 { "title":"大头儿子小偷爸爸", "price":100, "addr":"北京天安门", "company":{ "name":"我爱北京天安门", "company_addr":"我的家在东北松花江傻姑娘", "employee_count":10 }, "publish_date":"2019-08-19" } PUT books/_doc/2 { "title":"白雪公主和十个小矮人", "price":"99", "addr":"黑暗森里", "company":{ "name":"我的家乡在上海", "company_addr":"朋友一生一起走", "employee_count":10 }, "publish_date":"2018-05-19" } GET books/_mapping GET _analyze { "analyzer": "ik_max_word", "text": "白雪公主和十个小矮人" } GET books/_search { "query": { "match": { "title": "十" } } } PUT books2 { "mappings": { "properties":{ "title":{ "type":"text", "analyzer": "ik_smart" }, "price":{ "type":"integer" }, "addr":{ "type":"keyword" }, "company":{ "properties":{ "name":{"type":"text"}, "company_addr":{"type":"text"}, "employee_count":{"type":"integer"} } }, "publish_date":{"type":"date","format":"yyy-MM-dd"} } } } PUT books2/_doc/1 { "title":"大头儿子小偷爸爸", "price":100, "addr":"北京天安门", "company":{ "name":"我爱北京天安门", "company_addr":"我的家在东北松花江傻姑娘", "employee_count":10 }, "publish_date":"2019-08-19" } PUT books2/_doc/2 { "title":"白雪公主和十个小矮人", "price":"99", "addr":"黑暗森里", "company":{ "name":"我的家乡在上海", "company_addr":"朋友一生一起走", "employee_count":10 }, "publish_date":"2018-05-19" } GET _analyze { "analyzer": "ik_smart", "text": "白雪公主和十个小矮人" } GET books2/_search { "query": { "match": { "title": "十个" } } }
二 Python中集成es两种方式 1 原生集成
# Official low-level client for Elasticsearch ### 等同于pymysql #pip3 install elasticsearch from elasticsearch import Elasticsearch obj = Elasticsearch() # 得到一个对象 # 创建索引（Index） # result = obj.indices.create(index='user', body={"userid":'1','username':'lqz'},ignore=400) # print(result) # 删除索引 # result = obj.indices.delete(index='user', ignore=[400, 404]) # 插入数据 # data = {'userid': '1', 'username': 'lqz','password':'123'} # result = obj.create(index='news', doc_type='_doc', id=1, body=data) # print(result) # 更新数据 ''' 不用doc包裹会报错 ActionRequestValidationException[Validation Failed: 1: script or doc is missing ''' # data ={'doc':{'userid': '1', 'username': 'lqz','password':'123ee','test':'test'}} # result = obj.update(index='news', doc_type='_doc', body=data, id=1) # print(result) # 删除数据 # result = obj.delete(index='news', doc_type='_doc', id=1) # print(result) # 查询 # 查找所有文档 # query = {'query': {'match_all': {}}} # 查找名字叫做jack的所有文档 query = {'query': {'match': {'title': '十个'}}} # 查找年龄大于11的所有文档 # query = {'query': {'range': {'age': {'gt': 11}}}} allDoc = obj.search(index='books', doc_type='_doc', body=query) # print(allDoc) print(allDoc['hits']['hits'][0]['_source'])
2 dsl集成
# Elasticsearch DSL is a high-level # pip3 install elasticsearch-dsl from datetime import datetime from elasticsearch_dsl import document, Date, Nested, Boolean,analyzer, InnerDoc, Completion, Keyword, Text,Integer from elasticsearch_dsl.connections import connections connections.create_connection(hosts=["localhost"]) class Article(document): title = Text(analyzer='ik_max_word') author = Text() class Index: name = 'myindex' def save(self, ** kwargs): return super(Article, self).save(** kwargs) if __name__ == '__main__': # Article.init() # 创建索引 # 保存数据 # article = Article() # article.title = "测试测试阿斯顿发送到发斯蒂芬啊啊士大夫阿斯蒂芬" # article.author = "lqz" # article.save() # 数据就保存了 #查询数据 # s=Article.search() # s = s.filter('match', title="测试") # # results = s.execute() # 执行 # print(results[0].title) #删除数据 s = Article.search() s = s.filter('match', title="李清照").delete() #修改数据 # s = Article().search() # s = s.filter('match', title="测试") # results = s.execute() # print(results[0]) # results[0].title="李清照阿斯顿发送到发送阿斯蒂" # results[0].save()
三集群搭建（脑裂）
# 1 广播方式（一般不用） -只要es节点能联通，ping，自动加人到节点中 # 2 单播方式 #1 elasticsearch1节点，,集群名称是my_es1,集群端口是9300；节点名称是node1，监听本地9200端口，可以有权限成为主节点和读写磁盘（不写就是默认的）。 cluster.name: my_es1 node.name: node1 network.host: 127.0.0.1 http.port: 9200 transport.tcp.port: 9300 discovery.zen.ping.unicast.hosts: ["127.0.0.1:9300", "127.0.0.1:9302", "127.0.0.1:9303", "127.0.0.1:9304"] # 2 elasticsearch2节点,集群名称是my_es1,集群端口是9302；节点名称是node2，监听本地9202端口，可以有权限成为主节点和读写磁盘。 cluster.name: my_es1 node.name: node2 network.host: 127.0.0.1 http.port: 9202 transport.tcp.port: 9302 node.master: true node.data: true discovery.zen.ping.unicast.hosts: ["127.0.0.1:9300", "127.0.0.1:9302", "127.0.0.1:9303", "127.0.0.1:9304"] # 3 elasticsearch3节点，集群名称是my_es1,集群端口是9303；节点名称是node3，监听本地9203端口，可以有权限成为主节点和读写磁盘。 cluster.name: my_es1 node.name: node3 network.host: 127.0.0.1 http.port: 9203 transport.tcp.port: 9303 discovery.zen.ping.unicast.hosts: ["127.0.0.1:9300", "127.0.0.1:9302", "127.0.0.1:9303", "127.0.0.1:9304"] # 4 elasticsearch4节点，集群名称是my_es1,集群端口是9304；节点名称是node4，监听本地9204端口，仅能读写磁盘而不能被选举为主节点。 cluster.name: my_es1 node.name: node4 network.host: 127.0.0.1 http.port: 9204 transport.tcp.port: 9304 node.master: false node.data: true discovery.zen.ping.unicast.hosts: ["127.0.0.1:9300", "127.0.0.1:9302", "127.0.0.1:9303", "127.0.0.1:9304"] 由上例的配置可以看到，各节点有一个共同的名字my_es1,但由于是本地环境，所以各节点的名字不能一致，我们分别启动它们，它们通过单播列表相互介绍，发现彼此，然后组成一个my_es1集群。谁是老大则是要看谁先启动了！ #3 假设有7个节点 -由于网络问题 3个节点一组， 4 个节点一组形成了两个机器 -防止脑列防止脑裂，我们对最小集群节点数该集群设置参数：（集群节点总数/2+1的个数） discovery.zen.minimum_master_nodes: 3 # 3=5/2+1
四打分机制
1 确定文档和查询有多么相关的过程被称为打分
2 TF是词频（term frequency）:一个词条在文档中出现的次数,出现的频率越高，表示相关度越高 3IDF`是逆文档频率：如果一个词条在索引中的不同文档中出现的次数越多，那么它就越不重要

欢迎分享，转载请注明来源：内存溢出
原文地址: http://outofmemory.cn/zaji/5704965.html

Elasticsearch系列——python中的GIL锁、es的文档查询 *** 作、ik分词器的使用、Python中集成es两种方式、集群搭建（脑裂）、打分机制

发表评论

评论列表（0条）