文本分类实战----数据处理篇----embeddings与vocab中词汇不相覆盖问题的处理办法

上一篇文章我们讲了一些数据处理的方法。这一篇我们来对数据进行一些分析，帮助我们更好的理解数据的基础上，为后面的工作做一些基础。也希望有一些积累，在后面遇到相似的任务事可以举一反三。
好了，话不多说，我们开始。

dings与vocab中词汇不相覆盖问题

我们使用的预训练好的 dings词向量来对训练集和测试集中的词汇表vocab进行向量的映射，这里存在的一个问题是预训练 dings中的词汇不能完全覆盖vocab中的词汇，就导致不被覆盖的词汇只能用随机向量或者unknown向量表示，这样会影响最终的任务效果。导致这个问题的原因主要有几个：第一个是生僻词，他们没有出现在预训练的 dings中，但是这种情况相对较少；第二个是大小写和简写，预训练好的 dings是区分大小写的，所以这里也是一个原因；第三个是数据集中的单词拼写错误，这是最主要的原因，因为所给的数据是提问者键盘敲上去的，难免出现拼写的错误，错误拼写的单词肯定不会出现在预训练好的 dings中。但是这种情况也是很难处理的，对于这个比赛来说，要求不能使用额外的数据集，所以我们就不能使用外部的拼写检查库来矫正拼写错误。

好，那么我们就来看看如何分析这些数据的吧。

1. 建立词汇表vocab

vocab字典，建立单词与其出现频次的映射

train = pd.read_csv(\"../input/train.csv\")  # Train shape = (1306122, 3)
test = pd.read_csv(\"../input/test.csv\")  # Test shape = (56370, 2)
df = pd.concat([train ,test])  # shape=(1362492, 2)

def build_vocab(sentences, verbose =  True):
    \"\"\"
    :param sentences: list of list of words 输入是训练集与测试集的数据
    :return: dictionary of words and their count
    追踪训练词汇表，遍历所有文本对单词进行计数
    \"\"\"
    vocab = {}
    for sentence in tqdm(sentences, disable = (not verbose)):
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

sentences =df[\'question_text\'].progress_apply(lambda x: x.split()).values
vocab = build_vocab(sentences)   # vocab_size=508823

我们来看一下建立好的vocab字典是什么样子，打印出现频率前五的单词看一下：

{\'How\': 261930, \'did\': 33489, \'Quebec\': 97, \'nationalists\': 91, \'see\': 9003}

2. 加载预训练 dings

这里为了实现更好的效果，我们加载使用4种预训练的 dings

google = \'../input/ dings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin\'
glove = \'../input/ dings/glove.840B.300d/glove.840B.300d.txt\'
paragram =  \'../input/ dings/paragram_300_sl999/paragram_300_sl999.txt\'
wiki_news = \'../input/ dings/wiki-news-300d-1M/wiki-news-300d-1M.vec\'

from gensim.models import KeyedVectors

from gensim.models import KeyedVectors

def load_ (file):
    def get_coefs(word,*arr): 
        return word, np.asarray(arr, dtype=\'float32\')
    
    if file == \'../input/ dings/wiki-news-300d-1M/wiki-news-300d-1M.vec\':
         dings_index = dict(get_coefs(*o.split(\" \")) for o in open(file) if len(o)>100)
    elif file == \'../input/ dings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin\':
        model = KeyedVectors.load_word2vec_format(file, binary=True)
         dings_index = {}
        for word, vector in zip(model.vocab, model.vectors):
             dings_index[word] = vector
    else:
         dings_index = dict(get_coefs(*o.split(\" \")) for o in open(file, encoding=\'latin\'))
        
    return  dings_index

3. 检查预训练 dings和vocab的覆盖情况

def check_coverage(vocab,  dings_index):
    known_words = {}  # 两者都有的单词
    unknown_words = {}  #  dings不能覆盖的单词
    nb_known_words = 0  #对应的数量
    nb_unknown_words = 0
#     for word in vocab.keys():
    for word in tqdm(vocab):
        try:
            known_words[word] =  dings_index[word]
            nb_known_words += vocab[word]
        except:
            unknown_words[word] = vocab[word]
            nb_unknown_words += vocab[word]
            pass

    print(\'Found  dings for {:.2%} of vocab\'.format(len(known_words) / len(vocab))) # 覆盖单词的百分比
    print(\'Found  dings for  {:.2%} of all text\'.format(nb_known_words / (nb_known_words + nb_unknown_words))) # 覆盖文本的百分比，与上一个指标的区别的原因在于单词在文本中是重复出现的。
    unknown_words = sorted(unknown_words.items(), key=operator.itemgetter(1))[::-1]
    print(\"unknown words : \", unknown_words[:30]) 
    return unknown_words

oov_google = check_coverage(vocab,  _google)
oov_glove = check_coverage(vocab,  _glove)
oov_paragram = check_coverage(vocab,  _paragram)
oov_fasttext = check_coverage(vocab,  _fasttext)

然后我们来看一下结果，并且将unknown_words打印出30个来看看是什么情况

Google : 
Found  dings for 24.05% of vocab
Found  dings for  78.75% of all text
unknown words :  [(\'to\', 420476), (\'a\', 419837), (\'of\', 345145), (\'and\', 262815), (\'India?\', 17082), (\'it?\', 13436), (\'do?\', 9112), (\'life?\', 8074), (\'you?\', 6553), (\'me?\', 6485), (\'them?\', 6421), (\'time?\', 5994), (\'world?\', 5632), (\'people?\', 5191), (\'why?\', 5144), (\'Quora?\', 4872), (\'10\', 4783), (\'like?\', 4677), (\'for?\', 4631), (\'work?\', 4392), (\'2017?\', 4227), (\'mean?\', 4137), (\'2018?\', 3746), (\'country?\', 3578), (\'now?\', 3496), (\'this?\', 3464), (\'years?\', 3387), (\'2017\', 3300), (\'not?\', 3246), (\'year?\', 2913)]

Glove : 
Found  dings for 32.77% of vocab
Found  dings for  88.15% of all text
unknown words :  [(\'India?\', 17082), (\'it?\', 13436), (\"What\'s\", 12985), (\'do?\', 9112), (\'life?\', 8074), (\'you?\', 6553), (\'me?\', 6485), (\'them?\', 6421), (\'time?\', 5994), (\'world?\', 5632), (\'people?\', 5191), (\'why?\', 5144), (\'Quora?\', 4872), (\'like?\', 4677), (\'for?\', 4631), (\'work?\', 4392), (\'2017?\', 4227), (\'mean?\', 4137), (\'2018?\', 3746), (\'country?\', 3578), (\'now?\', 3496), (\'this?\', 3464), (\'years?\', 3387), (\'not?\', 3246), (\'year?\', 2913), (\'day?\', 2834), (\'engineering?\', 2743), (\'person?\', 2728), (\'school?\', 2688), (\'so,\', 2679)]

Paragram : 
Found  dings for 19.37% of vocab
Found  dings for  72.21% of all text
unknown words :  [(\'What\', 436013), (\'I\', 319441), (\'How\', 273144), (\'Why\', 148582), (\'Is\', 113627), (\'Can\', 54992), (\'Which\', 49357), (\'Do\', 41756), (\'If\', 35896), (\'Are\', 30442), (\'Does\', 24142), (\'Who\', 22884), (\'Where\', 20008), (\'Should\', 17269), (\'India?\', 17082), (\'Will\', 15283), (\'When\', 15084), (\'India\', 14270), (\'Indian\', 13441), (\'it?\', 13436), (\"I\'m\", 13344), (\"What\'s\", 12985), (\'Trump\', 10569), (\'Quora\', 10447), (\'In\', 10441), (\'Would\', 10307), (\'US\', 9832), (\'do?\', 9112), (\'My\', 8463), (\'The\', 8215)]

FastText : 
Found  dings for 29.77% of vocab
Found  dings for  87.66% of all text
unknown words :  [(\'India?\', 17082), (\"don\'t\", 15642), (\'it?\', 13436), (\"I\'m\", 13344), (\"What\'s\", 12985), (\'do?\', 9112), (\'life?\', 8074), (\"can\'t\", 7375), (\'you?\', 6553), (\'me?\', 6485), (\'them?\', 6421), (\'time?\', 5994), (\"doesn\'t\", 5970), (\'world?\', 5632), (\'people?\', 5191), (\'why?\', 5144), (\"it\'s\", 5019), (\'Quora?\', 4872), (\'like?\', 4677), (\'for?\', 4631), (\'work?\', 4392), (\'2017?\', 4227), (\'mean?\', 4137), (\'2018?\', 3746), (\'country?\', 3578), (\'now?\', 3496), (\'this?\', 3464), (\'years?\', 3387), (\"didn\'t\", 3329), (\'not?\', 3246)]

可以看出的几个问题：1.大写的问题，特别是paragram这个 dings，貌似是只要是大写的字母就不能识别。2.缩写的问题，比如(“don’t”, 15642)(“it’s”, 5019)等等，这个问题可以用匹配的方式解决，我们手写一个字典，将这种单词转换过来。3.标点符号，比如问号，这里的问题我不太确定是 dings没有问号的问题还是问号与单词合在一起不能被识别的问题，但是不管怎么样，我打算把句中的标点符号都去掉，因为对句意的贡献不大。。下面我们一个一个处理这些问题。

4. vocab大写转小写

df[\'lowered_question\'] = df[\'question_text\'].apply(lambda x: x.lower())
vocab_low = build_vocab(df[\'lowered_question\'])

oov_google = check_coverage(vocab_low,  _google)
oov_glove = check_coverage(vocab_low,  _glove)
oov_paragram = check_coverage(vocab_low,  _paragram)
oov_fasttext = check_coverage(vocab_low,  _fasttext)

那么我们再来看一下效果怎么样：（上面两行是处理前的效果，下面两行是处理后的结果）

Google : 
Found  dings for 24.05% of vocab
Found  dings for  78.75% of all text
Found  dings for 15.24% of vocab
Found  dings for  77.69% of all text

Glove : 
Found  dings for 32.77% of vocab
Found  dings for  88.15% of all text
Found  dings for 27.10% of vocab
Found  dings for  87.88% of all text

Paragram : 
Found  dings for 19.37% of vocab
Found  dings for  72.21% of all text
Found  dings for 31.01% of vocab
Found  dings for  88.21% of all text

FastText : 
Found  dings for 29.77% of vocab
Found  dings for  87.66% of all text
Found  dings for 21.74% of vocab
Found  dings for  87.14% of all text

可以看到，除了paragram有提升之外，其他三个 dings反而是降低了。所以下面一种想法是只对paragram进行大写转换，其他三个不转换；另外一种想法是吧 dings中没有的小写加入进去。我们先来看一下第二种方法的效果如何。

5. 向 dings中添加小写

def add_lower( ding, vocabulary):
    count = 0
#     for word in vocab:
    for word in tqdm(vocabulary):
        if word in  ding and word.lower() not in  ding:  
             ding[word.lower()] =  ding[word]
            count += 1

    print(f\"Added {count} words to  ding\")

add_lower( _google, vocab)
add_lower( _glove, vocab)
add_lower( _paragram, vocab)
add_lower( _fasttext, vocab)

看一下添加了多少单词进去：

Added 30276 words to  ding  # Google
Added 15199 words to  ding  # glove
Added 0 words to  ding  # paragram
Added 27908 words to  ding  # fasttext

然后看一下效果如何：(三步对比结果，按顺序)

Google：
Found  dings for 24.05% of vocab  15.24% of vocab  21.79% of vocab
Found  dings for  78.75% of all text   77.69% of all text  87.22% of all text

Glove : 
Found  dings for 32.77% of vocab 27.10% of vocab  30.39% of vocab
Found  dings for  88.15% of all text  87.88% of all text  88.19% of all text

Paragram : 
Found  dings for 19.37% of vocab  31.01% of vocab  31.01% of vocab
Found  dings for  72.21% of all text  88.21% of all text  88.21% of all text

FastText : 
Found  dings for 29.77% of vocab  21.74% of vocab  27.77% of vocab
Found  dings for  87.66% of all text  87.14% of all text  87.73% of all text

分析一下结果，对于paragram因为没有添加新的单词进去，所以效果没有变化。对于另外三种 dings，都是相对前两步中的最好结果在vocab上略有降低，all text上有所提升，特别是Google.

下面我们进行吧缩写的单词转换成正常形式。

6. 转换缩写形式

转化字典：

contraction_mapping = {\"ain\'t\": \"is not\", \"aren\'t\": \"are not\",\"can\'t\": \"cannot\", \"\'cause\": \"because\", \"could\'ve\": \"could have\", \"couldn\'t\": \"could not\", \"didn\'t\": \"did not\",  \"doesn\'t\": \"does not\", \"don\'t\": \"do not\", \"hadn\'t\": \"had not\", \"hasn\'t\": \"has not\", \"haven\'t\": \"have not\", \"he\'d\": \"he would\",\"he\'ll\": \"he will\", \"he\'s\": \"he is\", \"how\'d\": \"how did\", \"how\'d\'y\": \"how do you\", \"how\'ll\": \"how will\", \"how\'s\": \"how is\",  \"I\'d\": \"I would\", \"I\'d\'ve\": \"I would have\", \"I\'ll\": \"I will\", \"I\'ll\'ve\": \"I will have\",\"I\'m\": \"I am\", \"I\'ve\": \"I have\", \"i\'d\": \"i would\", \"i\'d\'ve\": \"i would have\", \"i\'ll\": \"i will\",  \"i\'ll\'ve\": \"i will have\",\"i\'m\": \"i am\", \"i\'ve\": \"i have\", \"isn\'t\": \"is not\", \"it\'d\": \"it would\", \"it\'d\'ve\": \"it would have\", \"it\'ll\": \"it will\", \"it\'ll\'ve\": \"it will have\",\"it\'s\": \"it is\", \"let\'s\": \"let us\", \"ma\'am\": \"madam\", \"mayn\'t\": \"may not\", \"might\'ve\": \"might have\",\"mightn\'t\": \"might not\",\"mightn\'t\'ve\": \"might not have\", \"must\'ve\": \"must have\", \"mustn\'t\": \"must not\", \"mustn\'t\'ve\": \"must not have\", \"needn\'t\": \"need not\", \"needn\'t\'ve\": \"need not have\",\"o\'clock\": \"of the clock\", \"oughtn\'t\": \"ought not\", \"oughtn\'t\'ve\": \"ought not have\", \"shan\'t\": \"shall not\", \"sha\'n\'t\": \"shall not\", \"shan\'t\'ve\": \"shall not have\", \"she\'d\": \"she would\", \"she\'d\'ve\": \"she would have\", \"she\'ll\": \"she will\", \"she\'ll\'ve\": \"she will have\", \"she\'s\": \"she is\", \"should\'ve\": \"should have\", \"shouldn\'t\": \"should not\", \"shouldn\'t\'ve\": \"should not have\", \"so\'ve\": \"so have\",\"so\'s\": \"so as\", \"this\'s\": \"this is\",\"that\'d\": \"that would\", \"that\'d\'ve\": \"that would have\", \"that\'s\": \"that is\", \"there\'d\": \"there would\", \"there\'d\'ve\": \"there would have\", \"there\'s\": \"there is\", \"here\'s\": \"here is\",\"they\'d\": \"they would\", \"they\'d\'ve\": \"they would have\", \"they\'ll\": \"they will\", \"they\'ll\'ve\": \"they will have\", \"they\'re\": \"they are\", \"they\'ve\": \"they have\", \"to\'ve\": \"to have\", \"wasn\'t\": \"was not\", \"we\'d\": \"we would\", \"we\'d\'ve\": \"we would have\", \"we\'ll\": \"we will\", \"we\'ll\'ve\": \"we will have\", \"we\'re\": \"we are\", \"we\'ve\": \"we have\", \"weren\'t\": \"were not\", \"what\'ll\": \"what will\", \"what\'ll\'ve\": \"what will have\", \"what\'re\": \"what are\",  \"what\'s\": \"what is\", \"what\'ve\": \"what have\", \"when\'s\": \"when is\", \"when\'ve\": \"when have\", \"where\'d\": \"where did\", \"where\'s\": \"where is\", \"where\'ve\": \"where have\", \"who\'ll\": \"who will\", \"who\'ll\'ve\": \"who will have\", \"who\'s\": \"who is\", \"who\'ve\": \"who have\", \"why\'s\": \"why is\", \"why\'ve\": \"why have\", \"will\'ve\": \"will have\", \"won\'t\": \"will not\", \"won\'t\'ve\": \"will not have\", \"would\'ve\": \"would have\", \"wouldn\'t\": \"would not\", \"wouldn\'t\'ve\": \"would not have\", \"y\'all\": \"you all\", \"y\'all\'d\": \"you all would\",\"y\'all\'d\'ve\": \"you all would have\",\"y\'all\'re\": \"you all are\",\"y\'all\'ve\": \"you all have\",\"you\'d\": \"you would\", \"you\'d\'ve\": \"you would have\", \"you\'ll\": \"you will\", \"you\'ll\'ve\": \"you will have\", \"you\'re\": \"you are\", \"you\'ve\": \"you have\" }

我们来看一下 dings中出现了哪些缩写的单词

def known_contractions( ):
    known = []
    for contract in contraction_mapping:
        if contract in  :
            known.append(contract)
    return known

print(\"- Known Contractions -\")
print(\"   Google :\")
print(known_contractions( _google))
print(\"   Glove :\")
print(known_contractions( _glove))
print(\"   Paragram :\")
print(known_contractions( _paragram))
print(\"   FastText :\")
print(known_contractions( _fasttext))

结果：

- Known Contractions -
   Google :
[\"ain\'t\", \"aren\'t\", \"can\'t\", \"could\'ve\", \"couldn\'t\", \"didn\'t\", \"doesn\'t\", \"don\'t\", \"hadn\'t\", \"hasn\'t\", \"haven\'t\", \"he\'d\", \"he\'ll\", \"he\'s\", \"how\'d\", \"how\'s\", \"I\'d\", \"I\'d\'ve\", \"I\'ll\", \"I\'m\", \"I\'ve\", \"i\'d\", \"i\'ll\", \"i\'m\", \"i\'ve\", \"isn\'t\", \"it\'d\", \"it\'ll\", \"it\'s\", \"let\'s\", \"ma\'am\", \"must\'ve\", \"o\'clock\", \"oughtn\'t\", \"she\'d\", \"she\'ll\", \"she\'s\", \"should\'ve\", \"shouldn\'t\", \"that\'s\", \"there\'s\", \"here\'s\", \"they\'d\", \"they\'ll\", \"they\'re\", \"they\'ve\", \"wasn\'t\", \"we\'d\", \"we\'ll\", \"we\'re\", \"we\'ve\", \"weren\'t\", \"what\'re\", \"what\'s\", \"what\'ve\", \"where\'d\", \"where\'s\", \"who\'ll\", \"who\'s\", \"who\'ve\", \"won\'t\", \"would\'ve\", \"wouldn\'t\", \"wouldn\'t\'ve\", \"y\'all\", \"you\'d\", \"you\'ll\", \"you\'re\", \"you\'ve\"]
   Glove :
[\"can\'t\", \"\'cause\", \"didn\'t\", \"doesn\'t\", \"don\'t\", \"I\'d\", \"I\'ll\", \"I\'m\", \"I\'ve\", \"i\'d\", \"i\'ll\", \"i\'m\", \"i\'ve\", \"it\'s\", \"ma\'am\", \"o\'clock\", \"that\'s\", \"you\'ll\", \"you\'re\"]
   Paragram :
[\"can\'t\", \"\'cause\", \"didn\'t\", \"doesn\'t\", \"don\'t\", \"i\'d\", \"i\'ll\", \"i\'m\", \"i\'ve\", \"it\'s\", \"ma\'am\", \"o\'clock\", \"that\'s\", \"you\'ll\", \"you\'re\"]
   FastText :
[]

这里还要继续细化一下，因为数据集里的数据，他们缩写用的标点也有可能是不一样的，而我们上面提到的转化字典里统一用的 \"\'\"这个标点。所以我们吧数据里的不同的缩写标点都统一换成我们使用的这个。

def clean_contractions(text, mapping):
    specials = [\"’\", \"‘\", \"´\", \"`\"]
    for s in specials:
        text = text.replace(s, \"\'\")
    text = \' \'.join([mapping[t] if t in mapping else t for t in text.split(\" \")])
    return text

df[\'treated_question\'] = df[\'lowered_question\'].apply(lambda x: clean_contractions(x, contraction_mapping))
vocab = build_vocab(df[\'treated_question\'])

此时我们再来看一下覆盖的情况：

Google：
Found  dings for 24.05% of vocab  15.24% of vocab  21.79% of vocab  21.88% of vocab
Found  dings for  78.75% of all text   77.69% of all text  87.22% of all text  87.39% of all text

Glove : 
Found  dings for 32.77% of vocab 27.10% of vocab  30.39% of vocab  30.53% of vocab
Found  dings for  88.15% of all text  87.88% of all text  88.19% of all text  88.56% of all text

Paragram : 
Found  dings for 19.37% of vocab  31.01% of vocab  31.01% of vocab  31.16% of vocab
Found  dings for  72.21% of all text  88.21% of all text  88.21% of all text  88.58% of all text

FastText : 
Found  dings for 29.77% of vocab  21.74% of vocab  27.77% of vocab  27.91% of vocab
Found  dings for  87.66% of all text  87.14% of all text  87.73% of all text  88.44% of all text

这次达到了更好的结果。

7. 标点符号的处理

这一步我们处理一下标点，首先看一下预训练的 dings可以不能识别什么标点

punct = \"/-\'?!.,#$%\\\'()*+-/:;<=>@[\\\\]^_`{|}~\" + \'\"\"“”’\' + \'∞θ÷α•à−β∅³π‘₹´°£€\\×™√²—–&\'

def unknown_punct( , punct):
    unknown = \'\'
    for p in punct:
        if p not in  :
            unknown += p
            unknown += \' \'
    return unknown

print(unknown_punct( _google, punct))
print(unknown_punct( _glove, punct))
print(unknown_punct( _paragram, punct))
print(unknown_punct( _fasttext, punct))

结果是这样的：

Google :
/ - \' ? ! . , \' ( ) - / : ; < [ \\ ] { | } \" \" “ ” ’ − ∅ ‘ ₹ ´ \\ — – 
Glove :
“ ” ’ ∞ θ ÷ α • à − β ∅ ³ π ‘ ₹ ´ ° £ € × ™ √ ² — – 
Paragram :
“ ” ’ ∞ θ ÷ α • à − β ∅ ³ π ‘ ₹ ´ ° £ € × ™ √ ² — – 
FastText :
_ `

然后我们吧一些不常见的标点替换成较为常见的，在看一下效果

punct_mapping = {\"‘\": \"\'\", \"₹\": \"e\", \"´\": \"\'\", \"°\": \"\", \"€\": \"e\", \"™\": \"tm\", \"√\": \" sqrt \", \"×\": \"x\", \"²\": \"2\", \"—\": \"-\", \"–\": \"-\", \"’\": \"\'\", \"_\": \"-\", \"`\": \"\'\", \'“\': \'\"\', \'”\': \'\"\', \'“\': \'\"\', \"£\": \"e\", \'∞\': \'infinity\', \'θ\': \'theta\', \'÷\': \'/\', \'α\': \'alpha\', \'•\': \'.\', \'à\': \'a\', \'−\': \'-\', \'β\': \'beta\', \'∅\': \'\', \'³\': \'3\', \'π\': \'pi\', }

def clean_special_chars(text, punct, mapping):
    for p in mapping:
        text = text.replace(p, mapping[p])  
    for p in punct:
        text = text.replace(p, f\' {p} \')
    specials = {\'\\u200b\': \' \', \'…\': \' ... \', \'\\ufeff\': \'\', \'करना\': \'\', \'है\': \'\'}  # Other special characters that I have to deal with in last
    for s in specials:
        text = text.replace(s, specials[s])
    return text

df[\'treated_question\'] = df[\'treated_question\'].apply(lambda x: clean_special_chars(x, punct, punct_mapping))
vocab = build_vocab(df[\'treated_question\'])

oov_glove = check_coverage(vocab,  _google)
oov_glove = check_coverage(vocab,  _glove)
oov_paragram = check_coverage(vocab,  _paragram)
oov_fasttext = check_coverage(vocab,  _fasttext)

分别看一下效果：

Google：
Found  dings for 53.59% of vocab
Found  dings for  87.36% of all text
unknown words :  [(\'?\', 1440789), (\',\', 244864), (\'.\', 139697), (\'\"\', 84574), (\"\'\", 81400), (\'-\', 71911), (\'(\', 58958), (\')\', 58944), (\'/\', 44071), (\'2017\', 9254), (\':\', 9048), (\'10\', 7795), (\'2018\', 7733), (\'12\', 4029), (\'\\\\\', 3695), (\'{\', 3320), (\'}\', 3298), (\'100\', 3260), (\'20\', 3169), (\']\', 2983), (\'[\', 2976), (\'15\', 2791), (\'12th\', 2679), (\'11\', 2650), (\'30\', 2387), (\'!\', 2346), (\'50\', 2321), (\'18\', 2268), (\'000\', 2177), (\'...\', 2011)]

glove：
Found  dings for 69.10% of vocab
Found  dings for  99.58% of all text
unknown words :  [(\'quorans\', 885), (\'brexit\', 542), (\'cryptocurrencies\', 525), (\'redmi\', 398), (\'coin \', 150), (\'oneplus\', 144), (\'uceed\', 126), (\'demonetisation\', 118), (\'bhakts\', 118), (\'upwork\', 117), (\'pokémon\', 117), (\'machedo\', 112), (\'gdpr\', 110), (\'adityanath\', 108), (\'bnbr\', 105), (\'boruto\', 105), (\'alshamsi\', 100), (\'dceu\', 94), (\'iiest\', 91), (\'litecoin\', 90), (\'unacademy\', 89), (\'sjws\', 89), (\'zerodha\', 85), (\'qoura\', 85), (\'tensorflow\', 82), (\'fiancé\', 76), (\'lnmiit\', 73), (\'kavalireddi\', 71), (\'doklam\', 70), (\'muoet\', 68)]

paragram：
Found  dings for 73.58% of vocab
Found  dings for  99.63% of all text
unknown words :  [(\'quorans\', 885), (\'brexit\', 542), (\'cryptocurrencies\', 525), (\'redmi\', 398), (\'coin \', 150), (\'oneplus\', 144), (\'uceed\', 126), (\'demonetisation\', 118), (\'bhakts\', 118), (\'upwork\', 117), (\'pokémon\', 117), (\'machedo\', 112), (\'gdpr\', 110), (\'adityanath\', 108), (\'bnbr\', 105), (\'boruto\', 105), (\'alshamsi\', 100), (\'dceu\', 94), (\'iiest\', 91), (\'litecoin\', 90), (\'unacademy\', 89), (\'sjws\', 89), (\'zerodha\', 85), (\'qoura\', 85), (\'tensorflow\', 82), (\'fiancé\', 76), (\'lnmiit\', 73), (\'kavalireddi\', 71), (\'doklam\', 70), (\'muoet\', 68)]

Found  dings for 60.75% of vocab
Found  dings for  99.45% of all text
unknown words :  [(\'quorans\', 885), (\'bitsat\', 583), (\'kvpy\', 369), (\'comedk\', 369), (\'quoran\', 325), (\'wbjee\', 246), (\'articleship\', 218), (\'viteee\', 193), (\'fortnite\', 166), (\'upes\', 164), (\'marksheet\', 151), (\'afcat\', 131), (\'uceed\', 126), (\'dropshipping\', 123), (\'bhakts\', 118), (\'iitjee\', 114), (\'machedo\', 112), (\'upsee\', 111), (\'bnbr\', 105), (\'alshamsi\', 100), (\'chsl\', 100), (\'iitian\', 99), (\'amcat\', 97), (\'josaa\', 96), (\'unacademy\', 89), (\'zerodha\', 85), (\'qoura\', 85), (\'nmat\', 80), (\'icos\', 79), (\'jiit\', 78)]

分析一下结果：
Google这个 dings对标点和数字是没有词向量的，而其余的三个没有对应词向量的基本都是拼写错误的单词，并且这三个 dings对text的覆盖率已经到了99%，Google的只有87%，怪不得看了几个代码，大家用的基本都是后面的这三个 dings。

文本分类实战----数据处理篇----embeddings与vocab中词汇不相覆盖问题的处理办法

浏览：1507 2026-05-06

dings与vocab中词汇不相覆盖问题

1. 建立词汇表vocab

2. 加载预训练 dings

3. 检查预训练 dings和vocab的覆盖情况

4. vocab大写转小写

5. 向 dings中添加小写

6. 转换缩写形式

7. 标点符号的处理

继续阅读与本文标签相同的文章

GitHub深度学习框架最新榜单（截至2017年10月）

德扑人机大战差点没搞成，这是几个你不知道的细节……

特别推荐 2026年05月18日星期一

精彩发现

热门标签

文本分类实战----数据处理篇----embeddings与vocab中词汇不相覆盖问题的处理办法

浏览：1507 2026-05-06

dings与vocab中词汇不相覆盖问题

1. 建立词汇表vocab

2. 加载预训练 dings

3. 检查预训练 dings和vocab的覆盖情况

4. vocab大写转小写

5. 向 dings中添加小写

6. 转换缩写形式

7. 标点符号的处理

继续阅读与本文标签相同的文章

2026-05-18栏目： 教程

2026-05-18栏目： 教程

2026-05-18栏目： 教程

2026-05-18栏目： 教程

2026-05-18栏目： 教程

2026-04-23栏目： 教程

2026-04-23栏目： 教程

2026-04-23栏目： 教程

2026-04-23栏目： 教程

2026-04-24栏目： 教程

特别推荐 2026年05月18日 星期一

精彩发现

热门标签

相关文章

2026-05-18栏目：教程

2026-05-18栏目：教程

2026-05-18栏目：教程

2026-05-18栏目：教程

2026-05-18栏目：教程

2026-04-23栏目：教程

2026-04-23栏目：教程

2026-04-23栏目：教程

2026-04-23栏目：教程

2026-04-24栏目：教程

特别推荐 2026年05月18日星期一