Example usage

This package, text_processing_util_mds24, includes four functions for processing and representing text data for machine learning tasks, specifically natural language processing. It provides three different functions for text representations that take a list of documents in the form of raw text: frequency_vectorizer, tfidf_vectorizer and tokenizer_padding. If users wish to represent text in another way, text_clean will make their lives easier by converting all characters to lower case, removing all punctuations and numbers, and splitting each document into a list of words. Examples on how to use these functions are documented on this page.

Imports

from text_processing_util_mds24 import (
    text_clean,
    frequency_vectorizer,
    tfidf_vectorizer,
    tokenizer_padding
)

Creating Text Documents

We will first create a sample list of documents using the first paragraph of On the Origin of Species by Charles Darwin. (Note that this book is in the public domain.) The paragraph is stored in the file origin_of_species.txt. Here, each sentence in the paragraph is an individual document.

with open("origin_of_species.txt", encoding="utf-8") as text_data_file:
    origin_of_species = [line.rstrip() for line in text_data_file]

origin_of_species

['When on board H.M.S. Beagle, as naturalist, I was much struck with certain facts in the distribution of the organic beings inhabiting South America, and in the geological relations of the present to the past inhabitants of that continent.',
 'These facts, as will be seen in the latter chapters of this volume, seemed to throw some light on the origin of species—that mystery of mysteries, as it has been called by one of our greatest philosophers.',
 'On my return home, it occurred to me, in 1837, that something might perhaps be made out on this question by patiently accumulating and reflecting on all sorts of facts which could possibly have any bearing on it.',
 'After five years’ work I allowed myself to speculate on the subject, and drew up some short notes; these I enlarged in 1844 into a sketch of the conclusions, which then seemed to me probable: from that period to the present day I have steadily pursued the same object.',
 'I hope that I may be excused for entering on these personal details, as I give them to show that I have not been hasty in coming to a decision.']

Cleaning the Text

text_clean() cleans raw text for further text processing. This function will convert all characters to lower case, remove punctuations as well as numbers, and split words by spaces. All other functions in this package will call text_clean() before transforming the text to other representations, and therefore accept raw text as input. The user can also use this function to clean texts before feeding the texts to another algorithm of their choice.

The usage of this function is demonstrated below.

cleaned_text = text_clean(origin_of_species)

print("Cleaned documents:")
for doc in cleaned_text:
    print(doc)

Cleaned documents:
['when', 'on', 'board', 'hms', 'beagle', 'as', 'naturalist', 'i', 'was', 'much', 'struck', 'with', 'certain', 'facts', 'in', 'the', 'distribution', 'of', 'the', 'organic', 'beings', 'inhabiting', 'south', 'america', 'and', 'in', 'the', 'geological', 'relations', 'of', 'the', 'present', 'to', 'the', 'past', 'inhabitants', 'of', 'that', 'continent']
['these', 'facts', 'as', 'will', 'be', 'seen', 'in', 'the', 'latter', 'chapters', 'of', 'this', 'volume', 'seemed', 'to', 'throw', 'some', 'light', 'on', 'the', 'origin', 'of', 'species—that', 'mystery', 'of', 'mysteries', 'as', 'it', 'has', 'been', 'called', 'by', 'one', 'of', 'our', 'greatest', 'philosophers']
['on', 'my', 'return', 'home', 'it', 'occurred', 'to', 'me', 'in', 'that', 'something', 'might', 'perhaps', 'be', 'made', 'out', 'on', 'this', 'question', 'by', 'patiently', 'accumulating', 'and', 'reflecting', 'on', 'all', 'sorts', 'of', 'facts', 'which', 'could', 'possibly', 'have', 'any', 'bearing', 'on', 'it']
['after', 'five', 'years’', 'work', 'i', 'allowed', 'myself', 'to', 'speculate', 'on', 'the', 'subject', 'and', 'drew', 'up', 'some', 'short', 'notes', 'these', 'i', 'enlarged', 'in', 'into', 'a', 'sketch', 'of', 'the', 'conclusions', 'which', 'then', 'seemed', 'to', 'me', 'probable', 'from', 'that', 'period', 'to', 'the', 'present', 'day', 'i', 'have', 'steadily', 'pursued', 'the', 'same', 'object']
['i', 'hope', 'that', 'i', 'may', 'be', 'excused', 'for', 'entering', 'on', 'these', 'personal', 'details', 'as', 'i', 'give', 'them', 'to', 'show', 'that', 'i', 'have', 'not', 'been', 'hasty', 'in', 'coming', 'to', 'a', 'decision']

In addition to cleaning the text, the package provides three different text representations to be used for machine learning models: frequency vectorizer, TF-IDF vectorizer and tokenizer plus padding.

Text Representation 1: Frequency Vectorizer

The frequency_vectorizer calculates the frequency of each word in a list of text documents to capture the significance of each word in each document. This function is useful for transforming text data into a feature matrix (word frequency matrix) that is to be used for machine learning.

The usage of this function is demonstrated below.

freq_matrix, freq_feature_names = frequency_vectorizer(origin_of_species)

print("Frequency matrix:")
print(freq_matrix)
print("\nFeature names:")
print(freq_feature_names)

Frequency matrix:
[[0.         0.         0.         0.         0.         0.02564103
02564103 0.         0.02564103 0.         0.02564103 0.
        0.02564103 0.02564103 0.         0.         0.02564103
        0.         0.         0.02564103 0.         0.
        0.         0.02564103 0.         0.         0.
        0.02564103 0.         0.         0.         0.02564103
        0.         0.         0.         0.         0.02564103
        0.         0.02564103 0.05128205 0.02564103 0.02564103
        0.         0.         0.         0.         0.
        0.         0.02564103 0.         0.         0.
        0.02564103 0.         0.         0.         0.
07692308 0.02564103 0.         0.02564103 0.         0.
        0.02564103 0.         0.         0.         0.
        0.         0.02564103 0.         0.         0.
        0.02564103 0.         0.         0.         0.
        0.         0.         0.         0.         0.
02564103 0.         0.         0.         0.02564103 0.
02564103 0.12820513 0.         0.         0.         0.
        0.02564103 0.         0.         0.02564103 0.02564103
        0.         0.02564103 0.         0.        ]
 [0.         0.         0.         0.         0.         0.
        0.         0.05405405 0.02702703 0.         0.
02702703 0.         0.         0.02702703 0.02702703 0.
02702703 0.         0.         0.         0.         0.
        0.         0.         0.         0.         0.
        0.02702703 0.         0.         0.         0.
        0.02702703 0.02702703 0.         0.         0.
        0.         0.         0.02702703 0.         0.
        0.02702703 0.02702703 0.02702703 0.         0.
        0.         0.         0.         0.         0.02702703
02702703 0.         0.         0.         0.         0.
10810811 0.02702703 0.02702703 0.         0.02702703 0.02702703
        0.         0.         0.         0.         0.
02702703 0.         0.         0.         0.         0.
        0.         0.         0.         0.02702703 0.02702703
        0.         0.         0.02702703 0.         0.
        0.02702703 0.         0.         0.         0.
        0.05405405 0.         0.         0.02702703 0.02702703
02702703 0.02702703 0.         0.02702703 0.         0.
        0.02702703 0.         0.         0.        ]
 [0.         0.02702703 0.         0.02702703 0.         0.
02702703 0.02702703 0.         0.02702703 0.         0.02702703
        0.         0.         0.02702703 0.         0.
        0.         0.         0.         0.02702703 0.
        0.         0.         0.         0.         0.
        0.02702703 0.         0.         0.         0.
        0.         0.         0.         0.02702703 0.
02702703 0.         0.         0.02702703 0.         0.
        0.05405405 0.         0.         0.02702703 0.
02702703 0.02702703 0.         0.02702703 0.         0.
        0.         0.         0.         0.         0.02702703
02702703 0.10810811 0.         0.         0.         0.
02702703 0.         0.02702703 0.02702703 0.         0.
        0.02702703 0.         0.         0.         0.02702703
02702703 0.         0.02702703 0.         0.         0.
        0.         0.         0.         0.02702703 0.02702703
        0.         0.         0.         0.         0.
02702703 0.         0.         0.         0.         0.02702703
        0.02702703 0.         0.         0.         0.
02702703 0.         0.         0.         0.        ]
 [0.02083333 0.         0.02083333 0.         0.02083333 0.
02083333 0.         0.         0.         0.         0.
        0.         0.         0.         0.         0.
        0.         0.02083333 0.         0.         0.02083333
        0.         0.         0.02083333 0.02083333 0.
        0.         0.02083333 0.         0.02083333 0.
        0.         0.         0.         0.02083333 0.
        0.         0.0625     0.02083333 0.         0.
02083333 0.         0.         0.         0.         0.
02083333 0.         0.         0.         0.02083333 0.
        0.         0.         0.02083333 0.02083333 0.
02083333 0.02083333 0.         0.         0.         0.
        0.         0.         0.         0.02083333 0.
        0.         0.02083333 0.02083333 0.02083333 0.
        0.         0.         0.02083333 0.02083333 0.
02083333 0.         0.02083333 0.02083333 0.         0.
        0.         0.02083333 0.02083333 0.         0.02083333
02083333 0.08333333 0.         0.02083333 0.02083333 0.
        0.0625     0.02083333 0.         0.         0.
02083333 0.         0.         0.02083333 0.02083333]
 [0.03333333 0.         0.         0.         0.         0.
        0.         0.03333333 0.03333333 0.         0.
03333333 0.         0.         0.         0.         0.
        0.03333333 0.         0.         0.         0.
03333333 0.03333333 0.         0.         0.         0.03333333
03333333 0.         0.         0.03333333 0.         0.
03333333 0.         0.         0.03333333 0.03333333 0.
        0.03333333 0.13333333 0.03333333 0.         0.
        0.         0.         0.         0.         0.03333333
        0.         0.         0.         0.         0.
        0.         0.03333333 0.         0.         0.
        0.03333333 0.         0.         0.         0.
        0.         0.         0.         0.         0.03333333
        0.         0.         0.         0.         0.
        0.         0.         0.         0.         0.
        0.03333333 0.         0.         0.         0.
        0.         0.         0.         0.         0.
06666667 0.         0.03333333 0.         0.03333333 0.
        0.06666667 0.         0.         0.         0.
        0.         0.         0.         0.        ]]

Feature names:
['a', 'accumulating', 'after', 'all', 'allowed', 'america', 'and', 'any', 'as', 'be', 'beagle', 'bearing', 'been', 'beings', 'board', 'by', 'called', 'certain', 'chapters', 'coming', 'conclusions', 'continent', 'could', 'day', 'decision', 'details', 'distribution', 'drew', 'enlarged', 'entering', 'excused', 'facts', 'five', 'for', 'from', 'geological', 'give', 'greatest', 'has', 'hasty', 'have', 'hms', 'home', 'hope', 'i', 'in', 'inhabitants', 'inhabiting', 'into', 'it', 'latter', 'light', 'made', 'may', 'me', 'might', 'much', 'my', 'myself', 'mysteries', 'mystery', 'naturalist', 'not', 'notes', 'object', 'occurred', 'of', 'on', 'one', 'organic', 'origin', 'our', 'out', 'past', 'patiently', 'perhaps', 'period', 'personal', 'philosophers', 'possibly', 'present', 'probable', 'pursued', 'question', 'reflecting', 'relations', 'return', 'same', 'seemed', 'seen', 'short', 'show', 'sketch', 'some', 'something', 'sorts', 'south', 'species—that', 'speculate', 'steadily', 'struck', 'subject', 'that', 'the', 'them', 'then', 'these', 'this', 'throw', 'to', 'up', 'volume', 'was', 'when', 'which', 'will', 'with', 'work', 'years’']

Text Representation 2: TF-IDF Vectorizer

The tfidf_vectorizer function computes the Term Frequency-Inverse Document Frequency (TF-IDF) scores for a given list of documents, providing a numerical representation that highlights the importance of terms within the context of the entire document set. This function is useful for transforming text data into a feature matrix, capturing the significance of terms while considering their frequency and uniqueness across the document collection.

The usage of this function is demonstrated below.

tdidf_matrix, feature_names = tfidf_vectorizer(origin_of_species)

print("Matrix of vectorized documents (TF-IDF):")
print(tdidf_matrix)
print("\nFeature names:")
print(feature_names)

Matrix of vectorized documents (TF-IDF):
[[ 0.          0.          0.          0.          0.          0.02349463
00572163  0.          0.00572163  0.          0.02349463  0.
         0.02349463  0.02349463  0.          0.          0.02349463
         0.          0.          0.02349463  0.          0.
         0.          0.02349463  0.          0.          0.
         0.00572163  0.          0.          0.          0.02349463
         0.          0.          0.          0.          0.02349463
         0.          0.00572163 -0.00934982  0.02349463  0.02349463
         0.          0.          0.          0.          0.
         0.          0.02349463  0.          0.          0.
         0.02349463  0.          0.          0.          0.
        -0.00467491  0.          0.02349463  0.          0.
         0.02349463  0.          0.          0.          0.
         0.          0.01309809  0.          0.          0.
         0.02349463  0.          0.          0.          0.
         0.          0.          0.          0.          0.
02349463  0.          0.          0.          0.02349463  0.
         0.02860815  0.          0.          0.          0.
        -0.00467491  0.          0.          0.02349463  0.02349463
         0.          0.02349463  0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.
         0.          0.01206181  0.00603091  0.          0.
0138061   0.          0.          0.0138061   0.02476461  0.
02476461  0.          0.          0.          0.          0.
         0.          0.          0.          0.          0.
         0.00603091  0.          0.          0.          0.
         0.02476461  0.02476461  0.          0.          0.
         0.          0.         -0.00492761  0.          0.
         0.0138061   0.02476461  0.02476461  0.          0.
         0.          0.          0.          0.          0.02476461
02476461  0.          0.          0.          0.          0.
        -0.00492761  0.02476461  0.          0.02476461  0.02476461
         0.          0.          0.          0.          0.
02476461  0.          0.          0.          0.          0.
         0.          0.          0.          0.0138061   0.02476461
         0.          0.          0.0138061   0.          0.
         0.02476461  0.          0.          0.          0.
         0.01206181  0.          0.          0.00603091  0.0138061
02476461 -0.00492761  0.          0.02476461  0.          0.
         0.02476461  0.          0.          0.        ]
 [ 0.          0.02476461  0.          0.02476461  0.          0.
00603091  0.02476461  0.          0.00603091  0.          0.02476461
         0.          0.          0.0138061   0.          0.
         0.          0.          0.          0.02476461  0.
         0.          0.          0.          0.          0.
         0.00603091  0.          0.          0.          0.
         0.          0.          0.          0.00603091  0.
02476461  0.          0.         -0.00492761  0.          0.
         0.0276122   0.          0.          0.02476461  0.
0138061   0.02476461  0.          0.02476461  0.          0.
         0.          0.          0.          0.          0.02476461
        -0.01971044  0.          0.          0.          0.
02476461  0.          0.02476461  0.02476461  0.          0.
         0.02476461  0.          0.          0.          0.02476461
02476461  0.          0.02476461  0.          0.          0.
         0.          0.          0.          0.02476461  0.02476461
         0.          0.          0.          0.          0.
         0.          0.          0.          0.          0.0138061
        -0.00492761  0.          0.          0.          0.
0138061   0.          0.          0.          0.        ]
 [ 0.0106422   0.          0.01908939  0.          0.01908939  0.
00464882  0.          0.          0.          0.          0.
         0.          0.          0.          0.          0.
         0.          0.01908939  0.          0.          0.01908939
         0.          0.          0.01908939  0.01908939  0.
         0.          0.01908939  0.          0.01908939  0.
         0.          0.          0.          0.00464882  0.
         0.          0.01394647 -0.00379837  0.          0.
01908939  0.          0.          0.          0.          0.
0106422   0.          0.          0.          0.01908939  0.
         0.          0.          0.01908939  0.01908939  0.
        -0.00379837  0.          0.          0.          0.
         0.          0.          0.          0.01908939  0.
         0.          0.0106422   0.01908939  0.01908939  0.
         0.          0.          0.01908939  0.0106422   0.
01908939  0.          0.01908939  0.0106422   0.          0.
         0.          0.01908939  0.01908939  0.          0.01908939
         0.0185953   0.          0.01908939  0.00464882  0.
        -0.0113951   0.01908939  0.          0.          0.
0106422   0.          0.          0.01908939  0.01908939]
 [ 0.01702752  0.          0.          0.          0.          0.
         0.          0.00743812  0.00743812  0.          0.
01702752  0.          0.          0.          0.          0.
         0.03054302  0.          0.          0.          0.
03054302  0.03054302  0.          0.          0.          0.03054302
03054302  0.          0.          0.03054302  0.          0.
03054302  0.          0.          0.03054302  0.00743812  0.
         0.03054302  0.02975247 -0.00607739  0.          0.
         0.          0.          0.          0.          0.03054302
         0.          0.          0.          0.          0.
         0.          0.03054302  0.          0.          0.
        -0.00607739  0.          0.          0.          0.
         0.          0.          0.          0.          0.03054302
         0.          0.          0.          0.          0.
         0.          0.          0.          0.          0.
         0.03054302  0.          0.          0.          0.
         0.          0.          0.          0.          0.
         0.          0.03054302  0.          0.00743812  0.
        -0.01215477  0.          0.          0.          0.
         0.          0.          0.          0.        ]]

Feature names:
['a', 'accumulating', 'after', 'all', 'allowed', 'america', 'and', 'any', 'as', 'be', 'beagle', 'bearing', 'been', 'beings', 'board', 'by', 'called', 'certain', 'chapters', 'coming', 'conclusions', 'continent', 'could', 'day', 'decision', 'details', 'distribution', 'drew', 'enlarged', 'entering', 'excused', 'facts', 'five', 'for', 'from', 'geological', 'give', 'greatest', 'has', 'hasty', 'have', 'hms', 'home', 'hope', 'i', 'in', 'inhabitants', 'inhabiting', 'into', 'it', 'latter', 'light', 'made', 'may', 'me', 'might', 'much', 'my', 'myself', 'mysteries', 'mystery', 'naturalist', 'not', 'notes', 'object', 'occurred', 'of', 'on', 'one', 'organic', 'origin', 'our', 'out', 'past', 'patiently', 'perhaps', 'period', 'personal', 'philosophers', 'possibly', 'present', 'probable', 'pursued', 'question', 'reflecting', 'relations', 'return', 'same', 'seemed', 'seen', 'short', 'show', 'sketch', 'some', 'something', 'sorts', 'south', 'species—that', 'speculate', 'steadily', 'struck', 'subject', 'that', 'the', 'them', 'then', 'these', 'this', 'throw', 'to', 'up', 'volume', 'was', 'when', 'which', 'will', 'with', 'work', 'years’']

Text Representation 3: Tokenizer and Padding

If you would like to feed the data to recurrent neural networks (RNNs), you can use transform your text with tokenizer_padding. This function converts each word into an individual token represented by a number (as an identifier), but keeps the order of the original sentence, which is important for RNNs. It also pads shorter sequences with zeros at the end because deep learning libraries generally do not accept sequences of uneven lengths.

The usage of this function is demonstrated below.

text_tokenized_padded = tokenizer_padding(origin_of_species)

print("Tokenized and padded sequences:")
print(text_tokenized_padded)

Tokenized and padded sequences:
[[  1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.  14.
 16.  17.  18.  16.  19.  20.  21.  22.  23.  24.  15.  16.  25.
 18.  16.  27.  28.  16.  29.  30.  18.  31.  32.   0.   0.   0.
  0.   0.   0.   0.   0.]
 [ 33.  14.   6.  34.  35.  36.  15.  16.  37.  38.  18.  39.  40.  41.
 42.  43.  44.   2.  16.  45.  18.  46.  47.  18.  48.   6.  49.
 51.  52.  53.  54.  18.  55.  56.  57.   0.   0.   0.   0.   0.
  0.   0.   0.   0.   0.]
 [  2.  58.  59.  60.  49.  61.  28.  62.  15.  31.  63.  64.  65.  35.
 67.   2.  39.  68.  53.  69.  70.  24.  71.   2.  72.  73.  18.
 74.  75.  76.  77.  78.  79.   2.  49.   0.   0.   0.   0.   0.
  0.   0.   0.   0.   0.]
 [ 80.  81.  82.  83.   8.  84.  85.  28.  86.   2.  16.  87.  24.  88.
 43.  90.  91.  33.   8.  92.  15.  93.  94.  95.  18.  16.  96.
 97.  41.  28.  62.  98.  99.  31. 100.  28.  16.  27. 101.   8.
102. 103.  16. 104. 105.]
 [  8. 106.  31.   8. 107.  35. 108. 109. 110.   2.  33. 111. 112.   6.
113. 114.  28. 115.  31.   8.  77. 116.  51. 117.  15. 118.  28.
119.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
  0.   0.   0.   0.   0.]]