Home Artificial Intelligence Construct a Language Model on Your WhatsApp Chats 1. Chosen Approach 2. Data Source 3. Tokenization 4. Indexing 5. Model Architecture 6. Model Training 7. Chat-Mode

Construct a Language Model on Your WhatsApp Chats 1. Chosen Approach 2. Data Source 3. Tokenization 4. Indexing 5. Model Architecture 6. Model Training 7. Chat-Mode

0
Construct a Language Model on Your WhatsApp Chats
1. Chosen Approach
2. Data Source
3. Tokenization
4. Indexing
5. Model Architecture
6. Model Training
7. Chat-Mode

To coach a language model, we want to interrupt language into pieces (so-called tokens) and feed them to the model incrementally. Tokenization could be performed on multiple levels.

  • Character-level: Text is perceived as a sequence of individual characters (including white spaces). This granular approach allows every possible word to be formed from a sequence of characters. Nevertheless, it’s tougher to capture semantic relationships between words.
  • Word-level: Text is represented as a sequence of words. Nevertheless, the model’s vocabulary is proscribed by the present words within the training data.
  • Sub-word-level: Text is broken down into sub-word units, that are smaller than words but larger than characters.

While I began off with a character-level tokenizer, I felt that training time was wasted, learning character sequences of repetitive words, moderately than specializing in the semantic relationship between words across the sentence.

For the sake of conceptual simplicity, I made a decision to modify to a word-level tokenizer, keeping aside the available libraries for more sophisticated tokenization strategies.

from nltk.tokenize import pTokenizer

def custom_tokenizer(txt: str, spec_tokens: List[str], pattern: str="|d|w+|[^s]") -> List[str]:
"""
Tokenize text into words or characters using NLTK's pTokenizer, considerung
given special mixtures as single tokens.

:param txt: The corpus as a single string element.
:param spec_tokens: A listing of special tokens (e.g. ending, out-of-vocab).
:param pattern: By default the corpus is tokenized on a word level (split by spaces).
Numbers are considered single tokens.

>> note: The pattern for character level tokenization is '|.'
"""
pattern = "|".join(spec_tokens) + pattern
tokenizer = pTokenizer(pattern)
tokens = tokenizer.tokenize(txt)
return tokens

["Alice:", "Hi", "how", "are", "you", "guys", "?", "", "Tom:", ... ]

It turned out that my training data has a vocabulary of ~70,000 unique words. Nevertheless, as many words appear only a couple of times, I made a decision to exchange such rare words by a “” special token. This had the effect of reducing vocabulary to ~25,000 words, which results in a smaller model that should be trained later.

from collections import Counter

def get_infrequent_tokens(tokens: Union[List[str], str], min_count: int) -> List[str]:
"""
Discover tokens that appear lower than a minimum count.

:param tokens: When it's the raw text in a string, frequencies are counted on character level.
When it's the tokenized corpus as list, frequencies are counted on token level.
:min_count: Threshold of occurence to flag a token.
:return: List of tokens that appear infrequently.
"""
counts = Counter(tokens)
infreq_tokens = set([k for k,v in counts.items() if v<=min_count])
return infreq_tokens

def mask_tokens(tokens: List[str], mask: Set[str]) -> List[str]:
"""
Iterate through all tokens. Any token that is an element of the set, is replaced by the unknown token.

:param tokens: The tokenized corpus.
:param mask: Set of tokens that shall be masked within the corpus.
:return: List of tokenized corpus after the masking operation.
"""
return [t.replace(t, unknown_token) if t in mask else t for t in tokens]

infreq_tokens = get_infrequent_tokens(tokens, min_count=2)
tokens = mask_tokens(tokens, infreq_tokens)

["Alice:", "Hi", "how", "are", "you", "", "?", "", "Tom:", ... ]

LEAVE A REPLY

Please enter your comment!
Please enter your name here