英文文本:Hamlet 分析词频
统计英文词频分为两步:
文本去噪及归一化使用字典表达词频代码:
#CalHamletV1.pydef getText(): txt = open("hamlet.txt","r").read() txt = txt.lower() for ch in '!"#$%&()*+,-./:;<=>?@[\]^_‘{|}~': txt = txt.replace(ch," ") #将文本中特殊字符替换为空格 return txt hamletTxt = getText()words = hamletTxt.split()counts = {}for word in words: counts[word] = counts.get(word,0) + 1items = List(counts.items())items.sort(key=lambda x:x[1],reverse=True) for i in range(10): word,count = items[i] print ("{0:<10}{1:>5}".format(word,count))
运行结果:
the 1138and 965to 754of 669you 550i 542a 542my 514hamlet 462in 436
中文文本词频统计中文文本:《三国演义》分析人物
统计中文词频分为两步:
中文文本分词使用字典表达词频#CalThreeKingdomsV1.pyimport jIEbatxt = open("threekingdoms.txt","r",enCoding='utf-8').read()words = jIEba.lcut(txt)counts = {}for word in words: if len(word) == 1: continue else: counts[word] = counts.get(word,reverse=True) for i in range(15): word,count))
运行结果:
曹 *** 953孔明 836将军 772却说 656玄德 585关公 510丞相 491二人 469不可 440荆州 425玄德曰 390孔明曰 390不能 384如此 378张飞 358
能很明显的看到有一些不相关或重复的信息
优化版本统计中文词频分为三步:
中文文本分词使用字典表达词频扩展程序解决问题我们将不相关或重复的信息放在 excludes 集合里面进行排除。
#CalThreeKingdomsV2.pyimport jIEbaexcludes = {"将军","却说","荆州","二人","不可","不能","如此"}txt = open("threekingdoms.txt",enCoding='utf-8').read()words = jIEba.lcut(txt)counts = {}for word in words: if len(word) == 1: continue elif word == "诸葛亮" or word == "孔明曰": rword = "孔明" elif word == "关公" or word == "云长": rword = "关羽" elif word == "玄德" or word == "玄德曰": rword = "刘备" elif word == "孟德" or word == "丞相": rword = "曹 *** " else: rword = word counts[rword] = counts.get(rword,0) + 1for word in excludes: del counts[word]items = List(counts.items())items.sort(key=lambda x:x[1],count))
考研英语词频统计将词频统计应用到考研英语中,我们可以统计出出现次数较多的关键单词。
文本链接: https://pan.baIDu.com/s/1Q6uVy-wWBpQ0VHvNI_DQxA 密码: fw3r
# CalHamletV1.pydef getText(): txt = open("86_17_1_2.txt"," ") #将文本中特殊字符替换为空格 return txtpyTxt = getText() #获得没有任何标点的txt文件words = pyTxt.split() #获得单词counts = {} #字典,键值对excludes = {"the","a","of","to","and","in","b","c","d","is",\ "was","are","have","were","had","that","for","it",\ "on","be","as","with","by","not","their","they",\ "from","more","but","or","you","at","has","we","an",\ "this","can","which","will","your","one","he","his","all","people","should","than","points","there","i","what","about","new","if","”",\ "its","been","part","so","who","would","answer","some","our","may","most","do","when","1","text","section","2","many","time","into",\ "10","no","other","up","following","【答案】","only","out","each","much","them","such","world","these","sheet","life","how","because","3","even",\ "work","directions","use","Could","Now","first","make","years","way","20","those","over","also","best","two","well","15","us","write","4","5","being","social","read","like","according","just","take","paragraph","any","english","good","after","own","year","must","american","less","her","between","then","children","before","very","human","long","while","often","my","too",\ "40","four","research","author","questions","still","last","business","education","need","information","public","says","passage","reading","through","women","she","health","example","help","get","different","him","mark","might","off","job","30","writing","choose","words","economic","become","scIEnce","socIEty","without","made","high","students","few","better","since","6","rather","however","great","where","culture","come",\ "both","three","same","government","old","find","number","means","study","put","8","change","does","today","think","future","school","yet","man","things","far","line","7","13","50","used","states","down","12","14","16","end","11","making","9","another","young","system","important","letter","17","chinese","every","see","s","test","word","century","language","little",\ "give","saID","25","state","problems","sentence","food","translation","given","child","18","longer","question","back","don’t","19","against","always","answers","kNow","having","among","instead","comprehension","large","35","want","likely","keep","family","go","why","41","home","law","place","look","day","men","22","26","45","it’s","others","companIEs","countrIEs","once","money","24","though",\ "27","29","31","say","national","ii","23","based","found","28","32","past","living","university","scIEntific","–","36","38","working","around","data","right","21","jobs","33","34","possible","feel","process","effect","growth","probably","seems","fact","below","37","39","history","technology","never","sentences","47","true","scIEntists","power","thought","during","48","early","parents",\ "something","market","times","46","certain","whether","000","dID","enough","problem","least","federal","age","IDea","learn","common","political","pay","vIEw","going","attention","happiness","moral","show","live","until","52","49","ago","percent","stress","43","44","42","meaning","51","e","iii","u","60","anything","53","55","cultural","nothing","short","100","water","car","56","58","【解析】","54","59","57","v","。","63","64","65","61","62","66","70","75","f","【考点分析】","67","here","68","71","72","69","73","74","选项a","ourselves","teachers","helps","参考范文","gdp","yourself","gone","150"}for word in words: if word not in excludes: counts[word] = counts.get(word,count))x = len(counts)print(x)r = 0next = eval(input("1继续"))while next == 1: r += 100 for i in range(r,r+100): word,count = items[i] print ("\"{}\"".format(word),end = ",") next = eval(input("1继续"))
总结 以上是内存溢出为你收集整理的【Python】词频统计全部内容,希望文章能够帮你解决【Python】词频统计所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)