CountVectorizer to Extract Features from Texts in Python, in Detail

Artificial Intelligence

CountVectorizer to Extract Features from Texts in Python, in Detail

admin

October 21, 2023

CountVectorizer to Extract Features from Texts in Python, in Detail

All the things you must know to make use of CountVectorizer efficiently in Sklearn

Essentially the most basic data processing that any Natural Language Processing (NLP) project requires is to convert the text data to the numeric data. So long as the info is in text form we cannot do any form of computation motion on it.

There are multiple methods available for this text-to-numeric data conversion. This tutorial will explain one of the vital basic vectorizers, the CountVectorizer method within the scikit-learn library.

This method could be very easy. It takes the frequency of occurrence of every word because the numeric value. An example will make it clear.

In the next code block:

We are going to import the CountVectorizer method.
Call the strategy.
Fit the text data to the CountVectorizer method and, convert that to an array.

import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer #That is the text to be vectorized
text = ["Hello Everyone! This is Lilly. My aunt's name is also Lilly. I love my aunt.
I am trying to learn how to use count vectorizer."]
cv= CountVectorizer() 
count_matrix = cv.fit_transform(text)
cnt_arr = count_matrix.toarray()
cnt_arr

Output:

array([[1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1]],
dtype=int64)

Here I actually have the numeric values representing the text data above.

How will we know which values represent which words within the text?

To make that clear, it can be helpful to convert the array right into a DataFrame where column names might be the words themselves.

cnt_df = pd.DataFrame(data = cnt_arr, columns = cv.get_feature_names())
cnt_df

Now, it shows clearly. The worth of the word ‘also’ is 1 which implies ‘also’ appeared just once within the test. The word ‘aunt’ got here twice within the text. So, the worth of the word ‘aunt’ is 2.

Within the last example, all of the sentences were in a single string. So, we got just one row of knowledge for 4 sentences. Let’s rearrange the text and…

All the things you must know to make use of CountVectorizer efficiently in Sklearn

LEAVE A REPLY Cancel reply