nlp-nltk使用

Python NLP

发布时间 : 2020-12-15 19:17

字数:1.1k 阅读 :

Nltk
Nltk的语料库

Nltk

安装nltk

pip install nltk

然后使用的时候报错：

Resource punkt not found. Please use the NLTK Downloader to obtain the resource:
>>> import nltk >>> nltk.download(‘punkt’)
使用提示代码下载词典：

nltk.download('punkt')

发现下载不下来，报错：getaddrinfo failed。

方法一

参考：nltk_data LookupError
到：nltk_data中下载punkt包，然后解压到D:\nltk_data\tokenizers目录下即可。

方法二

参考：离线安装nltk_data

打开Github-nltk_data，将第二个文件夹“packages”下载下来，下载Github文件夹可以用chrome插件：GitZip for github. 右键文件夹右边空白处就可以下载了
packages文件夹内容

然后，将packages中的所有内容拷贝到以下目录中任意一个：

- 'C:\\Users\\cunzhang/nltk_data'
    - 'D:\\Anaconda\\nltk_data'
    - 'D:\\Anaconda\\share\\nltk_data'
    - 'D:\\Anaconda\\lib\\nltk_data'
    - 'C:\\Users\\cunzhang\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'

linux中的目录是：

Searched in:
    - '/home/hadoopcj/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/home/hadoopcj/nltk_data'
    - ''

然后进入到”D:\nltk_data\tokenizers”目录，将punkt.zip解压 即可。

packages文件夹内容

nltk英文分词

from nltk import word_tokenize

paragraph = "When Jobs arrived back at Apple, it had a conventional structure for a company of its size and scope. It was divided into business units, each with its own P&L responsibilities."
words = word_tokenize(paragraph)
print(words)

词性标注

nltk中的词性：

tag	mean	释义	例子
CC	Coordinating conjunction	连词	and, or,but, if, while,although
CD	Cardinal number	数词	twenty-four, fourth, 1991,14:24
DT	Determiner	限定词	the, a, some, most,every, no
EX	Existential there	存在量词	there, there’s
FW	Foreign word	外来词	dolce, ersatz, esprit, quo,maitre
IN	Preposition or subordinating conjunction	介词连词	on, of,at, with,by,into, under
JJ	Adjective	形容词	new,good, high, special, big, local
JJR	Adjective, comparative	比较级词语	bleaker braver breezier briefer brighter brisker
JJS	Adjective, superlative	最高级词语	calmest cheapest choicest classiest cleanest clearest
LS	List item marker	标记	A A. B B. C C. D E F First G H I J K
MD	Modal	情态动词	can cannot could couldn’t
NN	Noun, singular or mass	名词	year,home, costs, time, education
NNS	Noun, plural	名词复数	undergraduates scotches
NNP	Proper noun, singular	专有名词	Alison,Africa,April,Washington
NNPS	Proper noun, plural	专有名词复数	Americans Americas Amharas Amityvilles
PDT	Predeterminer	前限定词	all both half many
POS	Possessive ending	所有格标记	’ ‘s
PRP	Personal pronoun	人称代词	hers herself him himself hisself
PRP$	Possessive pronoun	所有格	her his mine my our ours
RB	Adverb	副词	occasionally unabatingly maddeningly
RBR	Adverb, comparative	副词比较级	further gloomier grander
RBS	Adverb, superlative	副词最高级	best biggest bluntest earliest
RP	Particle	虚词	aboard about across along apart
SYM	Symbol	符号	% & ’ ” ”. ) )
TO	to	词to	to
UH	Interjection	感叹词	Goodbye Goody Gosh Wow
VB	Verb, base form	动词	ask assemble assess
VBD	Verb, past tense	动词过去式	dipped pleaded swiped
VBG	Verb, gerund or present participle	动词现在分词	telegraphing stirring focusing
VBN	Verb, past participle	动词过去分词	multihulled dilapidated aerosolized
VBP	Verb, non-3rd person singular present	动词现在式非第三人称时态	predominate wrap resort sue
VBZ	Verb, 3rd person singular present	动词现在式第三人称时态	bases reconstructs marks
WDT	Wh-determiner	Wh限定词	who,which,when,what,where,how
WP	Wh-pronoun	WH代词	that what whatever
WP$	Possessive wh-pronoun	WH代词所有格	whose
WRB	Wh-adverb	WH副词

# 分词后的词列表
paragraph='When Jobs arrived back at Apple, it had a conventional structure for a company of its size and scope. It was divided into business units,'
words = word_tokenize(paragraph)
# 词性标注
pos_tag = nltk.pos_tag(words)
print(pos_tag)

获取一个词的词性也得用列表：

t=nltk.pos_tag(['news'])
print(t)

Nltk的语料库

语料库在D:\nltk_data\corpora下：

参考：NLTK文本语料库

古腾堡语料库：gutenberg，包含古腾堡项目电子文本档案的一小部分文本。该项目目前大约有36000本免费的电子图书。
网络聊天语料库：webtext、nps_chat；这部分代表的是非正式的语言，包括Firefox交流论坛、在纽约无意听到的对话、《加勒比海盗》电影剧本。个人广告以及葡萄酒的评论。
布朗语料库：brown；布朗语意库是第一个百万词集的英语电子语料库，有布朗大学于1961年创建，包含500多个不同来源的文本，按照文本类型，如新闻、社评等分类。布朗语料库是一个研究文体之间系统性差异的资源。
路透社语料库：reuters；路透社语料库包括10788个新闻文档，共计130万字。这些文档分成了90个主题，按照‘训练’和‘测试’分为两组。因此，编号为‘test/14826’的文档属于测试组。这样分割是为了方便运用训练和测试算法的自动检验文档的主题。
就职演说语料库：inaugural；是55个文本的集合，每个文本都是一个总统的演讲。这个集合的显著特征就是时间维度。
标注文本语料库和其他语言语料库

转载请注明来源，欢迎对文章中的引用来源进行考证，欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论，也可以邮件至 changzeyan@foxmail.com