当前位置：首页 > 编程笔记 > 正文

已解决

NLP代码模板集合

来自网友在路上 162862提问提问时间：2023-11-06 01:09:11阅读次数： 62

最佳答案问答题库628位专家为你答疑解惑

文章目录

1 词基本操作
- 1.1 使用NLTK下载停用词
- 1.2 使用 spacy 加载语言模型
- 1.3 删除句子中的停用词
- - 1.3.1 Input
  - 1.3.2 Desired Output
  - 1.3.3 Solution
  - - 1.3.3.1 Method 1: Removing stopwords in nltk
    - 1.3.3.2 Method 2: Removing stopwords in spaCy
- 1.4 基于 spaCy，添加自定义停用词
- - 1.4.1 Input
  - 1.4.2 Expected Output
  - 1.4.3 Solution
- 1.5 删除标点符号
- - 1.5.1 Input
  - 1.5.2 Desired Output
  - 1.5.3 Solution
  - - 1.5.3.1 Method 1: Removing punctuations in spaCy
    - 1.5.3.2 Method 2: Removing punctuation in nltk with RegexpTokenizer
- 1.6 使用 bigram 将词汇合并成短语（非常重要）
- - 1.6.1 Input
  - 1.6.2 Desired Output
  - 1.6.3 Solution
- 1.7 统计 bigram, trigram（非常重要）
- - 1.7.1 Input
  - 1.7.2 Desired Output
  - 1.7.3 Solution
2 Tokenizer（分词）
- 2.1 使用 NLTK 或 spaCy 进行 tokenize
- - 2.1.1 Input
  - 2.1.2 Desired Output
  - 2.1.3 Solution
- 2.2 使用 transformers 进行 tokenize?（非常重要）
- - 2.2.1 Input
  - 2.2.2 Desired Output
  - 2.2.3 Solution
- 2.3 使用停用词进行tokenize
- - 2.3.1 Input
  - 2.3.2 Expected Output
  - 2.3.3 Solution
- 2.4 如何对 Tweeter等网文进行tokenizer？
- - 2.4.1 Input
  - 2.4.2 Desired Output
  - 2.4.3 Solution
3 句基本操作
- 3.1 如何将文档拆分成句子？
- - 3.1.1 Input
  - 3.1.2 Desired Output
  - 3.1.3 Solution
- 3.2 如何获得句子对应的遗存句法树？
- - 3.2.1 Input
  - 3.2.2 Desired Output
  - 3.2.3 Solution
- 3.3 词干提取（stemming）
- - 3.3.1 Input
  - 3.3.2 Desired Output
  - 3.3.3 Solution
- 3.4 词形还原（lemmatization）
- - 3.4.1 Input
  - 3.4.2 Desired Output
  - 3.4.3 Solution
- 3.5 词汇纠错（重要）
- - 3.5.1 Input
  - 3.5.2 Desired Output
  - 3.5.3 Solution
4 Information Extraction
- 4.1 如何从包含邮箱的文档中提取邮箱用户名？
- - 4.1.1 Input
  - 4.1.2 Desired Output
  - 4.1.3 Solution
- 4.2 从文档中提取所有名词？
- - 4.2.1 Input
  - 4.2.2 Desired Output
  - 4.2.3 Solution
- 4.3 从文档中提取所有人物指称
- - 4.3.1 Input
  - 4.3.2 Desired Output
  - 4.3.3 Solution
- 4.4 将指代人称替换成对应的人名
- - 4.4.1 Input
  - 4.4.2 Desired Output
  - 4.4.3 Solution
5 Text Similarity
- 5.1 提取最常用的词汇，但不包括停用词（重要）
- - 5.1.1 Input
  - 5.1.2 Desired Output
  - 5.1.3 Solution
- 5.2 两个词的相似度
- - 5.2.1 Input
  - 5.2.2 Desired Output
  - 5.2.3 Solution
- 5.3 获得两篇文档的相似度
- - 5.3.1 Input
  - 5.3.2 Desired Output
  - 5.3.3 Solution
- 5.4 获得两篇文档的 cosine 相似度
- - 5.4.1 Input
  - 5.4.2 Desired Output
  - 5.4.3 Solution
- 5.5 获得文档的 soft consine 相似度？
- - 5.5.1 Input
  - 5.5.2 Desired Output
  - 5.5.3 Solution
- 5.6 使用 Word2Vec 获得词汇的embedding
- - 5.6.1 Input
  - 5.6.2 Desired Output
  - 5.6.3 Solution
- 5.7 可视化 Word2Vec 中的词汇（重要）
- - 5.7.1 Solution
- 5.8 基于 Doc2Vec 获得文档的embedding
- - 5.8.1 Input
  - 5.8.2 Desired Output
  - 5.8.3 Solution
- 5.9 基于 Word2Vec 计算词汇相似度
- - 5.9.1 Desired Output
  - 5.9.2 Solution
- 5.10 计算 Word mover distance
- - 5.10.1 Input
  - 5.10.2 Desired Output
  - 5.10.3 Solution
6 Topic Model
- 6.1 基于LSA抽取 topic 关键词
- - 6.1.1 Input
  - 6.1.2 Desired Output
  - 6.1.3 Solution
- 6.2 使用LDA抽取 topic 关键词
- - 6.2.1 Input
  - 6.2.2 Desired Output
  - 6.2.3 Solution
- 6.3 基于NMF抽取 topic 关键词
- - 6.3.1 Input
  - 6.3.2 Desired Output
  - 6.3.3 Solution
- 6.4 计算 TF-IDF Matrix
- - 6.4.1 Input
  - 6.4.2 Desired Output
  - 6.4.3 Solution
  - - 6.4.3.1 方法一: Using gensim
    - 6.4.3.2 方法二: Using sklearn's TfidfVectorizer
- 6.5 识别文字的语种
- - 6.5.1 Input
  - 6.5.2 Desired Output
  - 6.5.3 Solution
- 6.6 将人名合并在一起：spaCy 的 retokenize() 方法
- - 6.6.1 Input
  - 6.6.2 Desired Output
  - 6.6.3 Solution
- 6.7 抽取名词短语: spaCy 的 noun_chunks
- - 6.7.1 Expected Output
  - 6.7.2 Solution
- 6.8 抽取动词短语：textacy.extract.pos_regex_matches
- - 6.8.1 Input
  - 6.8.2 Desired Output
  - 6.8.3 Solution
- 6.9 抽取人名: spacy 的 Matcher
- - 6.9.1 Input
  - 6.9.2 Desired Output
  - 6.9.3 Solution
- 6.10 抽取实体（NER）: entity.text
- - 6.10.1 Input
  - 6.10.2 Desired Output
  - 6.10.3 Solution
- 6.11 抽取组织机构: entity.label_=="ORG"
- - 6.11.1 Input
  - 6.11.2 Expected Solution
  - 6.11.3 Solution
- 6.12 将人名替换成 UNKNOWN
- - 6.12.1 Input
  - 6.12.2 Desired Output
  - 6.12.3 Solution
- 6.13 可视化句子中的人名：spaCy 的 displacy
- - 6.13.1 Input
  - 6.13.2 Solution
- 6.14 获得遗存句法树（dependency parsing）
- - 6.14.1 Input
  - 6.14.2 Desired Output
  - 6.14.3 Solution
- 6.15 基于遗存句法获得句子的 ROOT word
- - 6.15.1 Input
  - 6.15.2 Desired Output
  - 6.15.3 Solution
- 6.16 可视化遗存句法树（dependency tree）：spaCy 的 displacy
- - 6.16.1 Input
  - 6.16.2 Solution
- 6.17 抽取文本中的电脑公司名称
- - 6.17.1 Input
  - 6.17.2 Solution
7 Summarize Text
- 7.1 基于 gensim 做文本摘要：gensim.summarization.summarizer 的 summarize
- - 7.1.1 Input
  - 7.1.2 Desired Output
  - 7.1.3 Solution
- 7.2 使用 LexRank 做文本摘要：sumy.summarizers.lex_rank import LexRankSummarizer
- - 7.2.1 Input
  - 7.2.2 Desired Output
  - 7.2.3 Solution
- 7.3 使用 Luhn 算法做文本摘要：sumy.summarizers.luhn import LuhnSummarizer
- - 7.3.1 Input
  - 7.3.2 Desired Output
  - 7.3.3 Solution
- 7.4 使用 LSA 算法做文本摘要：sumy.summarizers.lsa import LsaSummarizer
- - 7.4.1 Input
  - 7.4.2 Solution
8 Text Classification
- 8.1 使用 TextBlob 做文本分类：textblob.classifiers import NaiveBayesClassifier
- - 8.1.1 Desired Output
  - 8.1.2 Solution
- 8.2 使用 Simple transformers 训练文本分类模型: simpletransformers.classification import ClassificationModel, ClassificationArgs
- - 8.2.1 Input
  - 8.2.2 Solution
- 8.3 使用 spaCy 做文本分类
- - 8.3.1 Solution
- 8.4 使用 transformers 训练情感分类
- - 8.4.1 Input text
  - 8.4.2 Desired Output
  - 8.4.3 Solution
- 8.5 使用 TextBlob 做情感分类
- - 8.5.1 Input
  - 8.5.2 Desired Output
  - 8.5.3 Solution
9 Text Generation
- 9.1 使用 simpletransformers 做机器翻译：simpletransformers.seq2seq import Seq2SeqModel
- - 9.1.1 Input
  - 9.1.2 Desired Output
  - 9.1.3 Solution
- 9.2 使用 transformers 构建问答系统（Question-Answering）
- - 9.2.1 Input
  - 9.2.2 Desired Output
  - 9.2.3 Solution
- 9.3 基于 transformers 做文本生成
- - 9.3.1 Input
  - 9.3.2 Desired Output
  - 9.3.3 Solution
10 模型
- 10.1 Self-attention实现
- 10.2 计算网络参数量
- 10.3 优化器参数拆分成分组
- 10.4 从checkpoint重训
- 10.5 混合精度计算
- 10.6 手动定义获取一个batch
- 10.7 基于predict值，得到每个标签的概率值
- 10.8 计算 acc
References

1 词基本操作

1.1 使用NLTK下载停用词

Difficulty Level : L1

是下载，不是用。先要下载，才能用。

# Downloading packages and importingimport nltk
nltk.download('punkt')
nltk.download('stop')
nltk.download('stopwords')#> [nltk_data] Downloading package punkt to /root/nltk_data...
#> [nltk_data]   Unzipping tokenizers/punkt.zip.
#> [nltk_data] Error loading stop: Package 'stop' not found in index
#> [nltk_data] Downloading package stopwords to /root/nltk_data...
#> [nltk_data]   Unzipping corpora/stopwords.zip.
#> True

1.2 使用 spacy 加载语言模型

Difficulty Level : L1

# 下载
python -m spacy download en_core_web_smimport spacy
nlp = spacy.load("en_core_web_sm")
nlp
# More models here: https://spacy.io/models
#> <spacy.lang.en.English at 0x7facaf6cd0f0>

1.3 删除句子中的停用词

Difficulty Level : L1

1.3.1 Input

text="""the outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives. Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others. Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be) the role of the different mass media channels in our lives on individual, social and societal levels? Mass media have long been recognized as powerful forces shaping how we experience the world and ourselves. This recognition is accompanied by a growing volume of research, that closely follows the footsteps of technological transformations (e.g. radio, movies, television, the internet, mobiles) and the zeitgeist (e.g. cold war, 9/11, climate change) in an attempt to map mass media major impacts on how we perceive ourselves, both as individuals and citizens. Are media (broadcast and digital) still able to convey a sense of unity reaching large audiences, or are messages lost in the noisy crowd of mass self-communication? """

1.3.2 Desired Output

'outbreak coronavirus disease 2019 ( COVID-19 ) created global health crisis deep impact way perceive world everyday lives . rate contagion patterns transmission threatens sense agency , safety measures place contain spread virus require social distancing refraining inherently human , find solace company . context physical threat , social physical distancing , public alarm , ( ) role different mass media channels lives individual , social societal levels ? Mass media long recognized powerful forces shaping experience world . recognition accompanied growing volume research , closely follows footsteps technological transformations ( e.g. radio , movies , television , internet , mobiles ) zeitgeist ( e.g. cold war , 9/11 , climate change ) attempt map mass media major impacts perceive , individuals citizens . media ( broadcast digital ) able convey sense unity reaching large audiences , messages lost noisy crowd mass self - communication ?'

1.3.3 Solution

1.3.3.1 Method 1: Removing stopwords in nltk

# Method 1
# Removing stopwords in nltkfrom nltk.corpus import stopwords
my_stopwords=set(stopwords.words('english'))
new_tokens=[]# Tokenization using word_tokenize()
all_tokens=nltk.word_tokenize(text)for token in all_tokens:if token not in my_stopwords:new_tokens.append(token)" ".join(new_tokens)

1.3.3.2 Method 2: Removing stopwords in spaCy

# Method 2
# Removing stopwords in spaCy
import spacynlp=spacy.load("en_core_web_sm")
doc=nlp(text)
new_tokens=[]# Using is_stop attribute of each token to check if it's a stopword
for token in doc:if token.is_stop==False:new_tokens.append(token.text)" ".join(new_tokens)

1.4 基于 spaCy，添加自定义停用词

Difficulty Level : L1

Q. Add the custom stopwords “NIL” and “JUNK” in spaCy and remove the stopwords in below text

1.4.1 Input

text=" Jonas was a JUNK great guy NIL Adam was evil NIL Martha JUNK was more of a fool "

1.4.2 Expected Output

  'Jonas great guy Adam evil Martha fool'

1.4.3 Solution

import spacynlp=spacy.load("en_core_web_sm")
# list of custom stop words
customize_stop_words = ['NIL','JUNK']# Adding these stop words
for w in customize_stop_words:nlp.vocab[w].is_stop = True
doc = nlp(text)
tokens = [token.text for token in doc if not token.is_stop]" ".join(tokens)

1.5 删除标点符号

Difficulty Level : L1

Q. Remove all the punctuations in the given text

1.5.1 Input

text="The match has concluded !!! India has won the match . Will we fin the finals too ? !"

1.5.2 Desired Output

'The match has concluded India has won the match Will we fin the finals too'

1.5.3 Solution

1.5.3.1 Method 1: Removing punctuations in spaCy

# Removing punctuations in spaCy
import spacynlp=spacy.load("en_core_web_sm")
doc=nlp(text)
new_tokens=[]
# Check if a token is a punctuation through is_punct attribute
for token in doc:if token.is_punct==False:new_tokens.append(token.text)" ".join(new_tokens)

1.5.3.2 Method 2: Removing punctuation in nltk with RegexpTokenizer

# Method 2
# Removing punctuation in nltk with RegexpTokenizertokenizer=nltk.RegexpTokenizer(r"\w+")tokens=tokenizer.tokenize(text)
" ".join(tokens)

1.6 使用 bigram 将词汇合并成短语（非常重要）

Difficulty Level : L3

传统的使用场景都是分词，但是很少有分短语，除非使用较为复杂的遗存句法解析工具。本题的目的是：将常见的两个词汇合并成短语。

核心是使用：Gensim’s Phraser。

1.6.1 Input

documents = ["the mayor of new york was there", "new york mayor was present"]

1.6.2 Desired Output

['the', 'mayor', 'of', 'new york', 'was', 'there']
['new york', 'mayor', 'was', 'present']

1.6.3 Solution

# Import Phraser from gensim
from gensim.models import Phrases
from gensim.models.phrases import Phrasersentence_stream = [doc.split(" ") for doc in documents]# Creating bigram phraser
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)for sent in sentence_stream:tokens_ = bigram_phraser[sent]print(tokens_)

1.7 统计 bigram, trigram（非常重要）

Difficulty Level : L3

1.7.1 Input

text="Machine learning is a neccessary field in today's world. Data science can do wonders . Natural Language Processing is how machines understand text "

1.7.2 Desired Output

Bigrams are [('machine', 'learning'), ('learning', 'is'), ('is', 'a'), ('a', 'neccessary'), ('neccessary', 'field'), ('field', 'in'), ('in', "today's"), ("today's", 'world.'), ('world.', 'data'), ('data', 'science'), ('science', 'can'), ('can', 'do'), ('do', 'wonders'), ('wonders', '.'), ('.', 'natural'), ('natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'how'), ('how', 'machines'), ('machines', 'understand'), ('understand', 'text')]Trigrams are [('machine', 'learning', 'is'), ('learning', 'is', 'a'), ('is', 'a', 'neccessary'), ('a', 'neccessary', 'field'), ('neccessary', 'field', 'in'), ('field', 'in', "today's"), ('in', "today's", 'world.'), ("today's", 'world.', 'data'), ('world.', 'data', 'science'), ('data', 'science', 'can'), ('science', 'can', 'do'), ('can', 'do', 'wonders'), ('do', 'wonders', '.'), ('wonders', '.', 'natural'), ('.', 'natural', 'language'), ('natural', 'language', 'processing'), ('language', 'processing', 'is'), ('processing', 'is', 'how'), ('is', 'how', 'machines'), ('how', 'machines', 'understand'), ('machines', 'understand', 'text')]

1.7.3 Solution

# 方法1
from nltk import ngrams
bigram = list(ngrams(text.lower().split(), 2))
trigram = list(ngrams(text.lower().split(), 3))print(" Bigrams are",bigram)
print(" Trigrams are", trigram)# 方法2
def ngram(text, n):# 将输入文本按空格分割为单词列表words = text.split()# 构建 ngram 列表ngram_list = []for i in range(len(words) - n + 1):ngram_list.append(' '.join(words[i:i+n]))return ngram_list

2 Tokenizer（分词）

2.1 使用 NLTK 或 spaCy 进行 tokenize

Difficulty Level : L1

2.1.1 Input

text="Last week, the University of Cambridge shared its own research that shows if everyone wears a mask outside home,dreaded ‘second wave’ of the pandemic can be avoided."

2.1.2 Desired Output

Last
week
,
the
University
of
Cambridge
shared
...(truncated)...

2.1.3 Solution

# 方法一：Tokeniation with nltk
tokens = nltk.word_tokenize(text)
for token in tokens:print(token)# 方法二：Tokenization with spaCy
lm = spacy.load("en_core_web_sm")
tokens = lm(text)
for token in tokens:print(token.text)

2.2 使用 transformers 进行 tokenize?（非常重要）

Difficulty Level : L1

2.2.1 Input

text="I love spring season. I go hiking with my friends"

2.2.2 Desired Output

[101, 1045, 2293, 3500, 2161, 1012, 1045, 2175, 13039, 2007, 2026, 2814, 102][CLS] i love spring season. i go hiking with my friends [SEP]

2.2.3 Solution

from transformers import AutoTokenizer# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')# Encoding with the tokenizer
inputs = tokenizer.encode(text)
print(inputs)
# 还可以这样用
print(tokenizer(text))# 解码
print(tokenizer.decode(inputs))

2.3 使用停用词进行tokenize

Difficulty Level : L2

Q. Tokenize the given text with stop words (“is”,”the”,”was”) as delimiters. Tokenizing this way identifies meaningful phrases. Sometimes, useful for topic modeling

2.3.1 Input

text = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know.""

2.3.2 Expected Output

['Walter','feeling anxious','He','diagnosed today','He probably','best person I know']

2.3.3 Solution

text = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know."stop_words_and_delims = ['was', 'is', 'the', '.', ',', '-', '!', '?']
for r in stop_words_and_delims:text = text.replace(r, 'DELIM')words = [t.strip() for t in text.split('DELIM')]
words_filtered = list(filter(lambda a: a not in [''], words))
print(words_filtered)

2.4 如何对 Tweeter等网文进行tokenizer？

Difficulty Level : L2

2.4.1 Input

text=" Having lots of fun #goa #vaction #summervacation. Fancy dinner @Beachbay restro :) "

2.4.2 Desired Output

['Having','lots','of','fun','goa','vaction','summervacation','Fancy','dinner','Beachbay','restro']

2.4.3 Solution

import re
# Cleaning the tweets
text=re.sub(r'[^\w]', ' ', text)# Using nltk's TweetTokenizer
from nltk.tokenize import TweetTokenizer
tokenizer=TweetTokenizer()
print(tokenizer.tokenize(text))

3 句基本操作

3.1 如何将文档拆分成句子？

Difficulty Level : L1

Q. Print the sentences of the given text document

3.1.1 Input

text="""The outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives. Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others. Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be) the role of the different mass media channels in our lives on individual, social and societal levels? Mass media have long been recognized as powerful forces shaping how we experience the world and ourselves. This recognition is accompanied by a growing volume of research, that closely follows the footsteps of technological transformations (e.g. radio, movies, television, the internet, mobiles) and the zeitgeist (e.g. cold war, 9/11, climate change) in an attempt to map mass media major impacts on how we perceive ourselves, both as individuals and citizens. Are media (broadcast and digital) still able to convey a sense of unity reaching large audiences, or are messages lost in the noisy crowd of mass self-communication? """

3.1.2 Desired Output

The outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives.
Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others.
Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be)
...(truncated)...

3.1.3 Solution

# 方法一：使用 spaCy
import spacy
lm = spacy.load('en_core_web_sm')
doc = lm(text)
for sentence in doc.sents:print(sentence)# 方法二：使用 NLTK
print(nltk.sent_tokenize(text))

3.2 如何获得句子对应的遗存句法树？

Difficulty Level : L3

3.2.1 Input

text1="Netflix has released a new series"
text2="It was shot in London"
text3="It is called Dark and the main character is Jonas"
text4="Adam is the evil character"

3.2.2 Desired Output

{'id': 0,'paragraphs': [{'cats': [],'raw': 'Netflix has released a new series','sentences': [{'brackets': [],'tokens': [{'dep': 'nsubj','head': 2,'id': 0,'ner': 'U-ORG','orth': 'Netflix','tag': 'NNP'},{'dep': 'aux','head': 1,'id': 1,'ner': 'O','orth': 'has','tag': 'VBZ'},{'dep': 'ROOT','head': 0,'id': 2,'ner': 'O','orth': 'released','tag': 'VBN'},{'dep': 'det', 'head': 2, 'id': 3, 'ner': 'O', 'orth': 'a', 'tag': 'DT'},{'dep': 'amod','head': 1,'id': 4,'ner': 'O','orth': 'new','tag': 'JJ'},{'dep': 'dobj','head': -3,'id': 5,'ner': 'O','orth': 'series','tag': 'NN'}]}]},...(truncated)

3.2.3 Solution

# Covert into spacy documents
doc1=nlp(text1)
doc2=nlp(text2)
doc3=nlp(text3)
doc4=nlp(text4)# Import docs_to_json 
from spacy.gold import docs_to_json# Converting into json format
json_data = docs_to_json([doc1,doc2,doc3,doc4])
print(json_data)

3.3 词干提取（stemming）

Difficulty Level : L2

3.3.1 Input

text= "Dancing is an art. Students should be taught dance as a subject in schools . I danced in many of my school function. Some people are always hesitating to dance."

3.3.2 Desired Output

text= 'danc is an art . student should be taught danc as a subject in school . I danc in mani of my school function . some peopl are alway hesit to danc .'

3.3.3 Solution

from nltk.stem import PorterStemmer
stemmer=PorterStemmer()
stemmed_tokens=[]
for token in nltk.word_tokenize(text):stemmed_tokens.append(stemmer.stem(token))" ".join(stemmed_tokens)# 还可以使用：
# 1. Porter
# 2. Snowball 更常用
# 3. Lancaster

3.4 词形还原（lemmatization）

Difficulty Level : L2

Q. Perform lemmatzation on the given text

Hint: Lemmatization Approaches

Stemming 和 lemmatization 虽然在学术上有严谨的区分，但是项目中一般只需要进行词性还原即可。

3.4.1 Input

text= "Dancing is an art. Students should be taught dance as a subject in schools . I danced in many of my school function. Some people are always hesitating to dance."

3.4.2 Desired Output

text= 'dancing be an art . student should be teach dance as a subject in school . -PRON- dance in many of -PRON- school function . some people be always hesitate to dance .'

3.4.3 Solution

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)lemmatized=[token.lemma_ for token in doc]
print

查看全文

99%的人还看了

相似问题

【ElasticSearch系列-03】ElasticSearch的高级句法查询Query DSL

猜你感兴趣

版权申明

本文"NLP代码模板集合"：http://eshow365.cn/6-33177-0.html 内容来自互联网，请自行判断内容的正确性。如有侵权请联系我们，立即删除！

上一篇: 【LeetCode】第 370 场周赛
下一篇: 项目实战：新增@Controller和@Service@Repository@Autowire四个注解

晴海小常识分享

晴海小常识分享

NLP代码模板集合

最佳答案 问答题库628位专家为你答疑解惑

文章目录

1 词基本操作

1.1 使用NLTK下载停用词

1.2 使用 spacy 加载语言模型

1.3 删除句子中的停用词

1.3.1 Input

1.3.2 Desired Output

1.3.3 Solution

1.3.3.1 Method 1: Removing stopwords in nltk

1.3.3.2 Method 2: Removing stopwords in spaCy

1.4 基于 spaCy，添加自定义停用词

1.4.1 Input

1.4.2 Expected Output

1.4.3 Solution

1.5 删除标点符号

1.5.1 Input

1.5.2 Desired Output

1.5.3 Solution

1.5.3.1 Method 1: Removing punctuations in spaCy

1.5.3.2 Method 2: Removing punctuation in nltk with RegexpTokenizer

1.6 使用 bigram 将词汇合并成短语（非常重要）

1.6.1 Input

1.6.2 Desired Output

1.6.3 Solution

1.7 统计 bigram, trigram（非常重要）

1.7.1 Input

1.7.2 Desired Output

1.7.3 Solution

2 Tokenizer（分词）

2.1 使用 NLTK 或 spaCy 进行 tokenize

2.1.1 Input

2.1.2 Desired Output

2.1.3 Solution

2.2 使用 transformers 进行 tokenize?（非常重要）

2.2.1 Input

2.2.2 Desired Output

2.2.3 Solution

2.3 使用停用词进行tokenize

2.3.1 Input

2.3.2 Expected Output

2.3.3 Solution

2.4 如何对 Tweeter等网文进行tokenizer？

2.4.1 Input

2.4.2 Desired Output

2.4.3 Solution

3 句基本操作

3.1 如何将文档拆分成句子？

3.1.1 Input

3.1.2 Desired Output

3.1.3 Solution

3.2 如何获得句子对应的遗存句法树？

3.2.1 Input

3.2.2 Desired Output

3.2.3 Solution

3.3 词干提取（stemming）

3.3.1 Input

3.3.2 Desired Output

3.3.3 Solution

3.4 词形还原（lemmatization）

3.4.1 Input

3.4.2 Desired Output

3.4.3 Solution

99%的人还看了

相似问题

猜你感兴趣

版权申明

推荐回答

最佳答案问答题库628位专家为你答疑解惑