text_processing_util_mds24
Submodules
Package Contents
Functions
|
Removes punctuation, turns all characters in each document to lower case, removes numbers in documents, and splits each document into a list of words. |
|
Calculates the frequency of each word in a list of text documents. |
|
Calculates TF-IDF scores for a list of documents. The TF-IDF score measures the importance of a word to its document, adjusted for the word's overall frequency in all documents. |
|
Converts each text document into a list of numerical tokens, which are numerical identifiers for each word, and pads shorter sequences so that each tokenized document has the same length. These steps make it possible for the transformed data to be accepted by deep learning libraries for building recurrent neural networks. |
Attributes
- text_processing_util_mds24.__version__
- text_processing_util_mds24.text_clean(docs: list[str]) list[list[str]][source]
Removes punctuation, turns all characters in each document to lower case, removes numbers in documents, and splits each document into a list of words.
- Parameters:
docs (list[str]) – Documents to be processed. Each item in the list is a document.
- Returns:
Cleaned documents.
- Return type:
list[list[str]]
Examples
>>> text_clean(["We are group 10.", "We are the best!"]) [["we", "are", "group"], ["we", "are", "the", "best"]]
- text_processing_util_mds24.frequency_vectorizer(docs: list[str]) tuple[numpy.ndarray, numpy.ndarray][source]
Calculates the frequency of each word in a list of text documents.
- Parameters:
docs (list[str]) – A list of text documents.
- Returns:
- Tuple containing two elements:
A 2D array containing frequency scores for each term in each document.
An array of feature names corresponding to the columns in the frequency matrix.
- Return type:
tuple[np.ndarray, np.ndarray]
Examples
>>> docs = ["This is a sample document.", "Another document for testing."] >>> result_tf_matrix, result_feature_names = frequency_vectorizer(documents) >>> print("Frequency Matrix:") >>> print(result_tf_matrix) Frequency Matrix: [[0.2 0. 0.2 0. 0.2 0.2 0. 0.2 ] [0. 0.25 0.25 0.25 0. 0. 0.25 0. ]] >>> print("Feature Names:") >>> print(result_feature_names) Feature Names: ['a', 'another', 'document', 'for', 'is', 'sample', 'testing', 'this']
- text_processing_util_mds24.tfidf_vectorizer(docs: list[str]) tuple[numpy.ndarray, numpy.ndarray][source]
Calculates TF-IDF scores for a list of documents. The TF-IDF score measures the importance of a word to its document, adjusted for the word’s overall frequency in all documents.
- Parameters:
docs (list[str]) – A list of documents (strings).
- Returns:
- Tuple containing two elements:
A 2D array containing TF-IDF scores for each term in each document.
An array of feature names corresponding to the columns in the TF-IDF matrix.
- Return type:
tuple[np.ndarray, np.ndarray]
Examples
>>> docs = ["Machine learning is interesting", "Python is widely used in machine learning"] >>> tdifd_matrix, feature_names = tfidf_vectorizer(docs) >>> print("TFIDF Matrix:") >>> print(tdifd_matrix) [[0. , 0.43550663, 0.43550663, 0.43550663, 0.43550663, 0.43550663] [0.57735027, 0. , 0. , 0. , 0. , 0. ]] >>> print(Feature Names:) >>> print(feature_names) ['in', 'interesting', 'is', 'learning', 'machine', 'python']
- text_processing_util_mds24.tokenizer_padding(docs: list[str]) numpy.ndarray[source]
Converts each text document into a list of numerical tokens, which are numerical identifiers for each word, and pads shorter sequences so that each tokenized document has the same length. These steps make it possible for the transformed data to be accepted by deep learning libraries for building recurrent neural networks.
- Parameters:
docs (list[str]) – A list of text documents.
- Returns:
2D array of tokenized and padded sequences of the input documents.
- Return type:
np.ndarray
Examples
>>> tokenized_padded = tokenizer_padding(["the first sentence", "the second longer sentence"]) >>> print(tokenized_padded) [[1, 2, 3, 0], [1, 4, 5, 3]] >>> tokenized_padded = tokenizer_padding(["a sample text", "sample text two"]) >>> print(tokenized_padded) [[1, 2, 3], [2, 3, 4]]