一道python統計詞頻難題，我用python給出三種方法！

私信小編007即可自動獲取大量Python視頻教程以及各類PDF！

在任意一個英文文檔中，統計單詞出現的次數，

分析:

本題不是很難，單詞通常以空格隔開，但是有些單詞後面跟一些特殊符號，只需把這些特殊符號替換掉就可以了，

代碼一

1 import re 2 3 file_name = 'code.txt' 4 5 lines_count = 0 6 words_count = 0 7 chars_count = 0 8 words_dict = {} 9 lines_list = []10 11 with open(file_name, 'r') as f:12 for line in f:13 lines_count = lines_count + 114 chars_count = chars_count + len(line)15 match = re.findall(r'[^a-zA-Z0-9]+', line)16 17 #正則 re.findall 的簡單用法（返回string中所有與pattern相匹配的全部字串，返回形式爲數組）語法：18 for i in match:19 # 只要英文單詞，刪掉其他字符20 line = line.replace(i, ' ')21 lines_list = line.split()22 for i in lines_list:23 if i not in words_dict:24 words_dict[i] = 125 else:26 words_dict[i] = words_dict[i] + 127 28 print('words_count is', len(words_dict))29 print('lines_count is', lines_count)30 print('chars_count is', chars_count)31 32 for k, v in words_dict.items():33 print( k, v)

該代碼有些囉嗦，網上找的，說下思路把，利用正則表達式找到所有的不是字母也不是數字的數據保存下來，然後再訪問文本中的數據，將非字母和數字的數據替換爲空

弱弱的說一句，直接替換掉不就完了。

代碼二：

這是本人所寫的，較代碼一稍微簡潔些；

import ref=open("code.txt",'r')s=f.read()s.replace("[^a-zA-Z]",' ')s=s.split()word={}for i in s: if i not in word: word[i]=1 else: word[i]=word[i]+1for k,v in word.items(): print(k,v)

代碼三：

你以爲你寫的夠簡潔了嗎？不，python早就幫你封裝好函數了。

點開才能看。

import collectionsimport re def calwords(path): word = [] with open(path) as file: data = file.readlines() for line in data: word += re.split(' |，',line.strip('\n')) print(collections.Counter(word)) if __name__ == '__main__': calwords('e://code.txt')

用到的方法說明

正則 re.findall 的簡單用法（返回string中所有與pattern相匹配的全部字串，返回形式爲數組）

語法：findall(pattern, string, flags=0)

string的replace方法，用後一個參數替換字符串中的前一個參數。

string.split方法

str.split()單一分隔符，使用str.split()即可 str.split不支持正則及多個切割符號，不感知空格的數量re.split() 多個分隔符，複雜的分隔情況，使用re.split原型： re.split(pattern, string, maxsplit=0)通過正則表達式將字符串分離。如果用括號將正則表達式括起來，那麼匹配的字符串也會被列入到list中返回。maxsplit是分離的次數，maxsplit=1分離一次，默認爲0，不限制次數。eg:>>>a='w w w'>>>import re1.空格分>>>re.split(r'[\s]',a)['w','w','w']2.只分割一次>>>re.split(r'[\s]',a,1)['w','ww']3.多個字符分割>>>c='w!w@w%w^w'>>>re.split(r'[!@%^],c)['w','w','w','w','w']4.還原?:>>>re.split(r'(?:!@%^),c)['w!w@w%w^w']

描述

Python strip() 方法用於移除字符串頭尾指定的字符（默認爲空格或換行符）或字符序列。

注意：該方法只能刪除開頭或是結尾的字符，不能刪除中間部分的字符。

couter 是一個容器，可以統計列表中元素的出現次數.

查看原文 >>