سرور مجازی NVMe

پایتون برای NLP: کار با کتابخانه Gensim (قسمت 1)

توسط مهران در بهمن 3, 1402

نحوه حذف یک فایل یا پوشه در پایتون در این بایت ما نحوه حذف فایل ها و پوشه ها در پایتون را بررسی خواهیم کرد. این یک کار رایج در بسیاری از زمینه های برنامه نویسی و برنامه نویسی است، به ویژه در زمینه هایی مانند پاک کردن داده ها، حذف موقت فایل، یا حتی هنگام کار با پایگاه های داده مبتنی بر فایل. شما باید حذف فایل را به دقت به عنوان یک خطا مدیریت کنید...

سرفصلهای مطلب

زمان لازم برای مطالعه: 13 دقیقه

این دهمین مقاله از سری مقالات من است روی پایتون برای NLP. در مقاله قبلی خود توضیح دادم که چگونه StanfordCoreNLP کتابخانه می تواند برای انجام وظایف مختلف NLP استفاده شود.

در این مقاله به بررسی آن می پردازیم جنسیم کتابخانه، که یکی دیگر از کتابخانه های بسیار مفید NLP برای پایتون است. Gensim در درجه اول برای مدل سازی موضوعی توسعه داده شد. با این حال، اکنون از انواع وظایف NLP دیگر مانند تبدیل کلمات به بردار (word2vec)، سند به بردار (doc2vec)، یافتن شباهت متن و خلاصه سازی متن پشتیبانی می کند.

در این مقاله و مقاله بعدی این مجموعه خواهیم دید که چگونه از کتابخانه Gensim برای انجام این کارها استفاده می شود.

نصب جنسیم

اگر استفاده می کنید pip نصب کننده برای نصب کتابخانه های پایتون خود، می توانید از دستور زیر برای دانلود کتابخانه Gensim استفاده کنید:

$ pip install gensim

همچنین، اگر از توزیع Anaconda پایتون استفاده می‌کنید، می‌توانید دستور زیر را برای نصب کتابخانه Gensim اجرا کنید:

$ conda install -c anaconda gensim

حالا ببینیم چگونه می‌توانیم وظایف مختلف NLP را با استفاده از کتابخانه Gensim انجام دهیم.

ایجاد فرهنگ لغت

الگوریتم های آماری با اعداد کار می کنند، با این حال، زبان های طبیعی حاوی داده هایی به شکل متن هستند. بنابراین مکانیزمی برای تبدیل کلمات به اعداد مورد نیاز است. به طور مشابه، پس از اعمال انواع مختلف فرآیندها روی اعداد، ما باید اعداد را به متن تبدیل کنیم.

یکی از راه‌های دستیابی به این نوع عملکرد، ایجاد فرهنگ لغت است که یک شناسه عددی را به هر کلمه منحصربه‌فرد در سند اختصاص می‌دهد. سپس می توان از فرهنگ لغت برای یافتن معادل عددی یک کلمه و بالعکس استفاده کرد.

ایجاد دیکشنری با استفاده از اشیاء درون حافظه

ایجاد فرهنگ لغت که کلمات را به ID با استفاده از کتابخانه Gensim پایتون نگاشت می کند، بسیار آسان است. به اسکریپت زیر نگاه کنید:

import gensim
from gensim import corpora
from pprint import pprint

text = ("""In computer science, artificial intelligence (AI),
             sometimes called machine intelligence, is intelligence
             demonstrated by machines, in contrast to the natural intelligence
             displayed by humans and animals. Computer science defines
             AI research as the study of intelligent agents: any device that
             perceives its environment and takes actions that maximize its chance
             of successfully achieving its goals.""")

tokens = ((token for token in sentence.split()) for sentence in text)
gensim_dictionary = corpora.Dictionary(tokens)

print("The dictionary has: " +str(len(gensim_dictionary)) + " tokens")

for k, v in gensim_dictionary.token2id.items():
    print(f'{k:{15}} {v:{10}}')

در اسکریپت بالا، ما ابتدا import را gensim کتابخانه به همراه corpora ماژول از کتابخانه در مرحله بعد، متنی داریم (که قسمت اول پاراگراف اول مقاله ویکی پدیا است روی هوش مصنوعی) ذخیره شده در text متغیر.

برای ایجاد یک فرهنگ لغت، ما به لیستی از کلمات از متن خود (که به عنوان نشانه نیز شناخته می شود) نیاز داریم. در خط زیر، سند خود را به جملات و سپس جملات را به کلمات تقسیم می کنیم.

tokens = ((token for token in sentence.split()) for sentence in text)

ما اکنون آماده ایجاد فرهنگ لغت خود هستیم. برای این کار می توانیم از Dictionary موضوع از corpora ماژول و آن را به لیست توکن ها منتقل کنید.

در نهایت، به print از محتویات فرهنگ لغت تازه ایجاد شده، می توانیم استفاده کنیم token2id موضوع از Dictionary کلاس خروجی اسکریپت بالا به شکل زیر است:

The dictionary has: 46 tokens
(AI),                    0
AI                       1
Computer                 2
In                       3
achieving                4
actions                  5
agents:                  6
and                      7
animals.                 8
any                      9
artificial              10
as                      11
by                      12
called                  13
chance                  14
computer                15
contrast                16
defines                 17
demonstrated            18
device                  19
displayed               20
environment             21
goals.                  22
humans                  23
in                      24
intelligence            25
intelligence,           26
intelligent             27
is                      28
its                     29
machine                 30
machines,               31
maximize                32
natural                 33
of                      34
perceives               35
research                36
science                 37
science,                38
sometimes               39
study                   40
successfully            41
takes                   42
that                    43
the                     44
to                      45

خروجی هر کلمه منحصر به فرد در متن ما را به همراه شناسه عددی که کلمه به آن اختصاص داده شده را نشان می دهد. کلمه یا نشانه کلید فرهنگ لغت و شناسه مقدار است. همچنین می توانید شناسه اختصاص داده شده به هر کلمه را با استفاده از اسکریپت زیر مشاهده کنید:

print(gensim_dictionary.token2id("study"))

در اسکریپت بالا، کلمه “مطالعه” را به عنوان کلید فرهنگ لغت خود پاس می کنیم. در خروجی باید مقدار مربوطه یعنی شناسه کلمه “study” را ببینید که 40 است.

به طور مشابه، می توانید از اسکریپت زیر برای پیدا کردن کلید یا کلمه برای یک شناسه خاص استفاده کنید.

print(list(gensim_dictionary.token2id.keys())(list(gensim_dictionary.token2id.values()).index(40)))

به print توکن‌ها و شناسه‌های مربوط به آن‌ها از یک حلقه for استفاده کردیم. با این حال، شما می توانید به طور مستقیم print توکن ها و شناسه های آنها با چاپ فرهنگ لغت، همانطور که در اینجا نشان داده شده است:

print(gensim_dictionary.token2id)

خروجی به صورت زیر است:

{'(AI),': 0, 'AI': 1, 'Computer': 2, 'In': 3, 'achieving': 4, 'actions': 5, 'agents:': 6, 'and': 7, 'animals.': 8, 'any': 9, 'artificial': 10, 'as': 11, 'by': 12, 'called': 13, 'chance': 14, 'computer': 15, 'contrast': 16, 'defines': 17, 'demonstrated': 18, 'device': 19, 'displayed': 20, 'environment': 21, 'goals.': 22, 'humans': 23, 'in': 24, 'intelligence': 25, 'intelligence,': 26, 'intelligent': 27, 'is': 28, 'its': 29, 'machine': 30, 'machines,': 31, 'maximize': 32, 'natural': 33, 'of': 34, 'perceives': 35, 'research': 36, 'science': 37, 'science,': 38, 'sometimes': 39, 'study': 40, 'successfully': 41, 'takes': 42, 'that': 43, 'the': 44, 'to': 45}

خروجی ممکن است به اندازه آنچه با استفاده از حلقه چاپ شده است واضح نباشد، اگرچه هنوز هم به هدف خود عمل می کند.

حال بیایید ببینیم چگونه می‌توانیم با استفاده از یک سند جدید، نشانه‌های بیشتری را به فرهنگ لغت موجود اضافه کنیم. به اسکریپت زیر نگاه کنید:

text = ("""Colloquially, the term "artificial intelligence" is used to
           describe machines that mimic "cognitive" functions that humans
           associate with other human minds, such as "learning" and "problem solving""")

tokens = ((token for token in sentence.split()) for sentence in text)
gensim_dictionary.add_documents(tokens)

print("The dictionary has: " + str(len(gensim_dictionary)) + " tokens")
print(gensim_dictionary.token2id)

در اسکریپت بالا یک سند جدید داریم که شامل قسمت دوم پاراگراف اول مقاله ویکی پدیا است روی هوش مصنوعی. ما متن را به نشانه ها تقسیم می کنیم و سپس به سادگی آن را فراخوانی می کنیم add_documents روشی برای اضافه کردن نشانه ها به فرهنگ لغت موجود ما. بالاخره ما print فرهنگ لغت به روز شده روی را console.

خروجی کد به شکل زیر است:

The dictionary has: 65 tokens
{'(AI),': 0, 'AI': 1, 'Computer': 2, 'In': 3, 'achieving': 4, 'actions': 5, 'agents:': 6, 'and': 7, 'animals.': 8, 'any': 9, 'artificial': 10, 'as': 11, 'by': 12, 'called': 13, 'chance': 14, 'computer': 15, 'contrast': 16, 'defines': 17, 'demonstrated': 18, 'device': 19, 'displayed': 20, 'environment': 21, 'goals.': 22, 'humans': 23, 'in': 24, 'intelligence': 25, 'intelligence,': 26, 'intelligent': 27, 'is': 28, 'its': 29, 'machine': 30, 'machines,': 31, 'maximize': 32, 'natural': 33, 'of': 34, 'perceives': 35, 'research': 36, 'science': 37, 'science,': 38, 'sometimes': 39, 'study': 40, 'successfully': 41, 'takes': 42, 'that': 43, 'the': 44, 'to': 45, '"artificial': 46, '"cognitive"': 47, '"learning"': 48, '"problem': 49, 'Colloquially,': 50, 'associate': 51, 'describe': 52, 'functions': 53, 'human': 54, 'intelligence"': 55, 'machines': 56, 'mimic': 57, 'minds,': 58, 'other': 59, 'solving': 60, 'such': 61, 'term': 62, 'used': 63, 'with': 64}

می بینید که اکنون 65 توکن در فرهنگ لغت خود داریم، در حالی که قبلاً 45 توکن داشتیم.

ایجاد فرهنگ لغت با استفاده از فایل های متنی

در قسمت قبل، متن درون حافظه داشتیم. اگر بخواهیم با خواندن یک فایل متنی از روی هارد دیکشنری دیکشنری بسازیم چطور؟ برای این کار می توانیم از simple_process روش از gensim.utils کتابخانه مزیت استفاده از این روش این است که فایل متنی را خط به خط می خواند و توکن های موجود در خط را برمی گرداند. برای ایجاد فرهنگ لغت نیازی نیست فایل متنی کامل را در حافظه بارگذاری کنید.

قبل از اجرای مثال بعدی، یک فایل “file1.txt” ایجاد کنید و متن زیر را به فایل اضافه کنید (این نیمه اول پاراگراف اول مقاله ویکی پدیا است. روی گرمایش جهانی).

Global warming is a long-term rise in the average temperature of the Earth's climate system, an aspect of climate change shown by temperature measurements and by multiple effects of the warming. Though earlier geological periods also experienced episodes of warming, the term commonly refers to the observed and continuing increase in average air and ocean temperatures since 1900 caused mainly by emissions of greenhouse gasses in the modern industrial economy.

حال بیایید یک فرهنگ لغت ایجاد کنیم که حاوی نشانه هایی از فایل متنی “file1.txt” باشد:

from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

gensim_dictionary = corpora.Dictionary(simple_preprocess(sentence, deacc=True) for sentence in open(r'E:\\text files\\file1.txt', encoding='utf-8'))

print(gensim_dictionary.token2id)

در اسکریپت بالا فایل متنی “file1.txt” را خط به خط با استفاده از simple_preprocess روش. این روش توکن ها را در هر خط از سند برمی گرداند. سپس از نشانه ها برای ایجاد فرهنگ لغت استفاده می شود. در خروجی باید توکن ها و شناسه های مربوط به آن ها را مانند شکل زیر مشاهده کنید:

{'average': 0, 'climate': 1, 'earth': 2, 'global': 3, 'in': 4, 'is': 5, 'long': 6, 'of': 7, 'rise': 8, 'system': 9, 'temperature': 10, 'term': 11, 'the': 12, 'warming': 13, 'an': 14, 'and': 15, 'aspect': 16, 'by': 17, 'change': 18, 'effects': 19, 'measurements': 20, 'multiple': 21, 'shown': 22, 'also': 23, 'earlier': 24, 'episodes': 25, 'experienced': 26, 'geological': 27, 'periods': 28, 'though': 29, 'air': 30, 'commonly': 31, 'continuing': 32, 'increase': 33, 'observed': 34, 'ocean': 35, 'refers': 36, 'temperatures': 37, 'to': 38, 'caused': 39, 'economy': 40, 'emissions': 41, 'gasses': 42, 'greenhouse': 43, 'industrial': 44, 'mainly': 45, 'modern': 46, 'since': 47}

به همین ترتیب، ما می توانیم با خواندن چندین فایل متنی یک فرهنگ لغت ایجاد کنیم. فایل دیگری به نام “file2.txt” ایجاد کنید و متن زیر را به فایل اضافه کنید (بخش دوم پاراگراف اول مقاله ویکی پدیا روی گرمایش جهانی):

In the modern context the terms global warming and climate change are commonly used interchangeably, but climate change includes both global warming and its effects, such as changes to precipitation and impacts that differ by region.(7)(8) Many of the observed warming changes since the 1950s are unprecedented in the instrumental temperature record, and in historical and paleoclimate proxy records of climate change over thousands to millions of years.

“file2.txt” را در همان دایرکتوری “file1.txt” ذخیره کنید.

اسکریپت زیر هر دو فایل را می خواند و سپس یک دیکشنری مبتنی بر ایجاد می کند روی متن در دو فایل:

from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

class ReturnTokens(object):
    def __init__(self, dir_path):
        self.dir_path = dir_path

    def __iter__(self):
        for file_name in os.listdir(self.dir_path):
            for sentence in open(os.path.join(self.dir_path, file_name), encoding='utf-8'):
                yield simple_preprocess(sentence)

path_to_text_directory = r"E:\text files"
gensim_dictionary = corpora.Dictionary(ReturnTokens(path_to_text_directory))

print(gensim_dictionary.token2id)

در اسکریپت بالا یک متد داریم ReturnTokens، که مسیر دایرکتوری حاوی “file1.txt” و “file2.txt” را به عنوان تنها پارامتر می گیرد. در داخل متد، تمام فایل‌های موجود در دایرکتوری را تکرار می‌کنیم و سپس هر فایل را خط به خط می‌خوانیم. این simple_preprocess متد برای هر خط نشانه هایی ایجاد می کند. نشانه های هر خط با استفاده از کلمه کلیدی “بازده” به تابع فراخوان بازگردانده می شوند.

در خروجی باید توکن های زیر را به همراه شناسه آنها مشاهده کنید:

{'average': 0, 'climate': 1, 'earth': 2, 'global': 3, 'in': 4, 'is': 5, 'long': 6, 'of': 7, 'rise': 8, 'system': 9, 'temperature': 10, 'term': 11, 'the': 12, 'warming': 13, 'an': 14, 'and': 15, 'aspect': 16, 'by': 17, 'change': 18, 'effects': 19, 'measurements': 20, 'multiple': 21, 'shown': 22, 'also': 23, 'earlier': 24, 'episodes': 25, 'experienced': 26, 'geological': 27, 'periods': 28, 'though': 29, 'air': 30, 'commonly': 31, 'continuing': 32, 'increase': 33, 'observed': 34, 'ocean': 35, 'refers': 36, 'temperatures': 37, 'to': 38, 'caused': 39, 'economy': 40, 'emissions': 41, 'gasses': 42, 'greenhouse': 43, 'industrial': 44, 'mainly': 45, 'modern': 46, 'since': 47, 'are': 48, 'context': 49, 'interchangeably': 50, 'terms': 51, 'used': 52, 'as': 53, 'both': 54, 'but': 55, 'changes': 56, 'includes': 57, 'its': 58, 'precipitation': 59, 'such': 60, 'differ': 61, 'impacts': 62, 'instrumental': 63, 'many': 64, 'record': 65, 'region': 66, 'that': 67, 'unprecedented': 68, 'historical': 69, 'millions': 70, 'over': 71, 'paleoclimate': 72, 'proxy': 73, 'records': 74, 'thousands': 75, 'years': 76}

ایجاد مجموعه کلمات

فرهنگ لغت شامل نگاشت بین کلمات و مقادیر عددی مربوط به آنهاست. مجموعه‌های کیسه کلمات در کتابخانه Gensim برپا شده‌اند روی دیکشنری و حاوی شناسه هر کلمه به همراه دفعات وقوع کلمه است.

ایجاد کیسه کلمات از اشیاء درون حافظه

به اسکریپت زیر نگاه کنید:

import gensim
from gensim import corpora
from pprint import pprint

text = ("""In computer science, artificial intelligence (AI),
           sometimes called machine intelligence, is intelligence
           demonstrated by machines, in contrast to the natural intelligence
           displayed by humans and animals. Computer science defines
           AI research as the study of intelligent agents: any device that
           perceives its environment and takes actions that maximize its chance
           of successfully achieving its goals.""")

tokens = ((token for token in sentence.split()) for sentence in text)

gensim_dictionary = corpora.Dictionary()
gensim_corpus = (gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens)

print(gensim_corpus)

در اسکریپت بالا، متنی داریم که به توکن ها تقسیم می کنیم. بعد، a را مقداردهی اولیه می کنیم Dictionary شی از corpora مدول. شی شامل یک متد است doc2bow، که اساساً دو وظیفه را انجام می دهد:

در تمام کلمات متن تکرار می شود، اگر کلمه از قبل در بدنه وجود داشته باشد، تعداد دفعات کلمه را افزایش می دهد.
در غیر این صورت کلمه را وارد بدنه می کند و تعداد فرکانس آن را روی 1 قرار می دهد

خروجی اسکریپت بالا به شکل زیر است:

(((0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 3), (26, 1), (27, 1), (28, 1), (29, 3), (30, 1), (31, 1), (32, 1), (33, 1), (34, 2), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 2), (44, 2), (45, 1)))

خروجی ممکن است برای شما منطقی نباشد. بذار توضیحش بدم تاپل اول (0،1) اساساً به این معنی است که کلمه با شناسه 0 1 بار در متن رخ داده است. به همین ترتیب، (25، 3) به این معنی است که کلمه با شناسه 25 سه بار در سند آمده است.

حالا بیایید print کلمه و فرکانس برای روشن شدن همه چیز مهم است. خطوط کد زیر را در انتهای اسکریپت قبلی اضافه کنید:

word_frequencies = (((gensim_dictionary(id), frequence) for id, frequence in couple) for couple in gensim_corpus)
print(word_frequencies)

خروجی به شکل زیر است:

((('(AI),', 1), ('AI', 1), ('Computer', 1), ('In', 1), ('achieving', 1), ('actions', 1), ('agents:', 1), ('and', 2), ('animals.', 1), ('any', 1), ('artificial', 1), ('as', 1), ('by', 2), ('called', 1), ('chance', 1), ('computer', 1), ('contrast', 1), ('defines', 1), ('demonstrated', 1), ('device', 1), ('displayed', 1), ('environment', 1), ('goals.', 1), ('humans', 1), ('in', 1), ('intelligence', 3), ('intelligence,', 1), ('intelligent', 1), ('is', 1), ('its', 3), ('machine', 1), ('machines,', 1), ('maximize', 1), ('natural', 1), ('of', 2), ('perceives', 1), ('research', 1), ('science', 1), ('science,', 1), ('sometimes', 1), ('study', 1), ('successfully', 1), ('takes', 1), ('that', 2), ('the', 2), ('to', 1)))

از خروجی می بینید که کلمه “هوش” سه بار ظاهر می شود. به همین ترتیب، کلمه “که” دو بار ظاهر می شود.

ایجاد کیسه کلمات از فایل های متنی

مانند دیکشنری ها، می توانیم a ایجاد کنیم کیسه کلمات پیکره با خواندن یک فایل متنی. به کد زیر نگاه کنید:

from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

tokens = (simple_preprocess(sentence, deacc=True) for sentence in open(r'E:\text files\file1.txt', encoding='utf-8'))

gensim_dictionary = corpora.Dictionary()
gensim_corpus = (gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens)
word_frequencies = (((gensim_dictionary(id), frequence) for id, frequence in couple) for couple in gensim_corpus)

print(word_frequencies)

در اسکریپت بالا، ما با استفاده از مجموعه ای از کلمات ایجاد کردیم file1.txt. در خروجی باید کلمات پاراگراف اول مقاله گرمایش جهانی را ببینید روی ویکیپدیا.

((('average', 1), ('climate', 1), ('earth', 1), ('global', 1), ('in', 1), ('is', 1), ('long', 1), ('of', 1), ('rise', 1), ('system', 1), ('temperature', 1), ('term', 1), ('the', 2), ('warming', 1)), (('climate', 1), ('of', 2), ('temperature', 1), ('the', 1), ('warming', 1), ('an', 1), ('and', 1), ('aspect', 1), ('by', 2), ('change', 1), ('effects', 1), ('measurements', 1), ('multiple', 1), ('shown', 1)), (('of', 1), ('warming', 1), ('also', 1), ('earlier', 1), ('episodes', 1), ('experienced', 1), ('geological', 1), ('periods', 1), ('though', 1)), (('average', 1), ('in', 1), ('term', 1), ('the', 2), ('and', 2), ('air', 1), ('commonly', 1), ('continuing', 1), ('increase', 1), ('observed', 1), ('ocean', 1), ('refers', 1), ('temperatures', 1), ('to', 1)), (('in', 1), ('of', 1), ('the', 1), ('by', 1), ('caused', 1), ('economy', 1), ('emissions', 1), ('gasses', 1), ('greenhouse', 1), ('industrial', 1), ('mainly', 1), ('modern', 1), ('since', 1)))

خروجی نشان می دهد که کلماتی مانند “of”، “the”، “by” و “and” دو بار تکرار می شوند.

به طور مشابه، می‌توانید با استفاده از چندین فایل متنی، مانند شکل زیر، مجموعه‌ای از مجموعه کلمات ایجاد کنید:

from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

class ReturnTokens(object):
    def __init__(self, dir_path):
        self.dir_path = dir_path

    def __iter__(self):
        for file_name in os.listdir(self.dir_path):
            for sentence in open(os.path.join(self.dir_path, file_name), encoding='utf-8'):
                yield simple_preprocess(sentence)

path_to_text_directory = r"E:\text files"

gensim_dictionary = corpora.Dictionary()
gensim_corpus = (gensim_dictionary.doc2bow(token, allow_update=True) for token in ReturnTokens(path_to_text_directory))
word_frequencies = (((gensim_dictionary(id), frequence) for id, frequence in couple) for couple in gensim_corpus)

print(word_frequencies)

خروجی اسکریپت بالا به شکل زیر است:

((('average', 1), ('climate', 1), ('earth', 1), ('global', 1), ('in', 1), ('is', 1), ('long', 1), ('of', 1), ('rise', 1), ('system', 1), ('temperature', 1), ('term', 1), ('the', 2), ('warming', 1)), (('climate', 1), ('of', 2), ('temperature', 1), ('the', 1), ('warming', 1), ('an', 1), ('and', 1), ('aspect', 1), ('by', 2), ('change', 1), ('effects', 1), ('measurements', 1), ('multiple', 1), ('shown', 1)), (('of', 1), ('warming', 1), ('also', 1), ('earlier', 1), ('episodes', 1), ('experienced', 1), ('geological', 1), ('periods', 1), ('though', 1)), (('average', 1), ('in', 1), ('term', 1), ('the', 2), ('and', 2), ('air', 1), ('commonly', 1), ('continuing', 1), ('increase', 1), ('observed', 1), ('ocean', 1), ('refers', 1), ('temperatures', 1), ('to', 1)), (('in', 1), ('of', 1), ('the', 1), ('by', 1), ('caused', 1), ('economy', 1), ('emissions', 1), ('gasses', 1), ('greenhouse', 1), ('industrial', 1), ('mainly', 1), ('modern', 1), ('since', 1)), (('climate', 1), ('global', 1), ('in', 1), ('the', 2), ('warming', 1), ('and', 1), ('change', 1), ('commonly', 1), ('modern', 1), ('are', 1), ('context', 1), ('interchangeably', 1), ('terms', 1), ('used', 1)), (('climate', 1), ('global', 1), ('warming', 1), ('and', 2), ('change', 1), ('effects', 1), ('to', 1), ('as', 1), ('both', 1), ('but', 1), ('changes', 1), ('includes', 1), ('its', 1), ('precipitation', 1), ('such', 1)), (('in', 1), ('of', 1), ('temperature', 1), ('the', 3), ('warming', 1), ('by', 1), ('observed', 1), ('since', 1), ('are', 1), ('changes', 1), ('differ', 1), ('impacts', 1), ('instrumental', 1), ('many', 1), ('record', 1), ('region', 1), ('that', 1), ('unprecedented', 1)), (('climate', 1), ('in', 1), ('of', 2), ('and', 2), ('change', 1), ('to', 1), ('historical', 1), ('millions', 1), ('over', 1), ('paleoclimate', 1), ('proxy', 1), ('records', 1), ('thousands', 1), ('years', 1)))

ایجاد TF-IDF Corpus

رویکرد کیسه کلمات برای تبدیل متن به اعداد خوب عمل می کند. با این حال، یک ایراد دارد. بر اساس کلمه به یک نمره امتیاز می دهد روی وقوع آن در یک سند خاص این واقعیت را در نظر نمی گیرد که ممکن است این کلمه در اسناد دیگر نیز فراوانی فراوانی داشته باشد. TF-IDF این موضوع را حل می کند.

اصطلاح فرکانس به صورت زیر محاسبه می شود:

Term frequency = (Frequency of the word in a document)/(Total words in the document)

و فرکانس معکوس سند به صورت زیر محاسبه می شود:

IDF(word) = Log((Total number of documents)/(Number of documents containing the word))

با استفاده از کتابخانه Gensim، می توانیم به راحتی یک مجموعه TF-IDF ایجاد کنیم:

import gensim
from gensim import corpora
from pprint import pprint

text = ("I like to play Football",
       "Football is the best game",
       "Which game do you like to play ?")

tokens = ((token for token in sentence.split()) for sentence in text)

gensim_dictionary = corpora.Dictionary()
gensim_corpus = (gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens)

from gensim import models
import numpy as np

tfidf = models.TfidfModel(gensim_corpus, smartirs='ntc')

for sent in tfidf(gensim_corpus):
    print(((gensim_dictionary(id), np.around(frequency, decimals=2)) for id, frequency in sent))

برای یافتن مقدار TF-IDF می توانیم از TfidfModel کلاس از models ماژول کتابخانه Gensim. ما به سادگی باید کیسه پیکره کلمه را به عنوان یک پارامتر به سازنده the پاس دهیم TfidfModel کلاس در خروجی، تمام کلمات موجود در سه جمله را به همراه مقادیر TF-IDF آنها خواهید دید:

(('Football', 0.3), ('I', 0.8), ('like', 0.3), ('play', 0.3), ('to', 0.3))
(('Football', 0.2), ('best', 0.55), ('game', 0.2), ('is', 0.55), ('the', 0.55))
(('like', 0.17), ('play', 0.17), ('to', 0.17), ('game', 0.17), ('?', 0.47), ('Which', 0.47), ('do', 0.47), ('you', 0.47))

دانلود مدل‌ها و مجموعه داده‌های داخلی Gensim

Gensim با مجموعه‌های داده داخلی و مدل‌های جاسازی کلمه ارائه می‌شود که می‌توان مستقیماً از آنها استفاده کرد.

برای دانلود یک مدل داخلی یا مجموعه داده، می توانیم از downloader کلاس از gensim کتابخانه سپس می توانیم روش بارگذاری را فراخوانی کنیم روی را downloader کلاس برای دانلود بسته مورد نظر. به کد زیر نگاه کنید:

import gensim.downloader as api

w2v_embedding = api.load("glove-wiki-gigaword-100")

با دستورات بالا، مدل تعبیه کلمه “glove-wiki-gigaword-100” را دانلود می کنیم که اساسا مبتنی بر آن است. روی متن ویکی پدیا و 100 بعدی است. بیایید سعی کنیم با استفاده از مدل تعبیه کلمه خود، کلمات مشابه “تویوتا” را پیدا کنیم. برای این کار از کد زیر استفاده کنید:

w2v_embedding.most_similar('toyota')

در خروجی باید نتایج زیر را مشاهده کنید:

(('honda', 0.8739858865737915),
 ('nissan', 0.8108116984367371),
 ('automaker', 0.7918163537979126),
 ('mazda', 0.7687169313430786),
 ('bmw', 0.7616022825241089),
 ('ford', 0.7547588348388672),
 ('motors', 0.7539199590682983),
 ('volkswagen', 0.7176680564880371),
 ('prius', 0.7156582474708557),
 ('chrysler', 0.7085398435592651))

شما می توانید ببینید که تمام نتایج بسیار مرتبط با کلمه “toyota” هستند. عدد موجود در کسر با شاخص شباهت مطابقت دارد. شاخص تشابه بالاتر به این معنی است که کلمه مرتبط تر است.

نتیجه

کتابخانه Gensim یکی از محبوب ترین کتابخانه های پایتون برای NLP است. در این مقاله، به طور خلاصه بررسی کردیم که چگونه می توان از کتابخانه Gensim برای انجام کارهایی مانند فرهنگ لغت و ایجاد پیکره استفاده کرد. همچنین روش دانلود ماژول های داخلی Gensim را دیدیم. در مقاله بعدی، روش انجام مدل‌سازی موضوع از طریق کتابخانه Gensim را خواهیم دید.

(برچسب‌ها به ترجمه)# python

منتشر شده در 1403-01-23 02:48:05

امتیاز شما به این مطلب