Natural Language Processing: How A Computer Learns Language

Daisy Morales

- Last Updated: December 2, 2024

Daisy Morales

- Last Updated: December 2, 2024

You may not remember much about learning your first language because you were probably very young. However, you know that it involved things like learning the alphabet, vowels, reading books, and verbal interactions with adults. If you have recently learned a new language, the process is much fresher in your mind. This is similar to how computer programs learn to understand and interact with human language. The process is natural language processing and involves a few steps before a computer can "speak."

"If you have recently learned a new language, the process is much fresher in your mind. This is similar to how computer programs learn to understand and interact with human language."

Natural Language Processing

Let's take a look at the steps that must be taken before a computer can understand and interact:

Data Collection

For a computer to understand human language, it must first be exposed to a large amount of data from different sources such as books, articles, and social media. With the proliferation of data online, the Internet has become a vast repository of data for training computer models. Companies have started to tap into this repository, with Google recently updating their privacy policy to clearly state that they can use anything online to build their AI models.

Like Google’s AI models, we also use the Internet to continue learning about our language. Even adults continually learn new words, especially more colloquial terms. (I just recently learned about "rizz" and "dupe.")

Tokenization

Tokenization is a way of translating words or parts of words into numbers/vectors, called embeddings, that are meaningful representations of the word.

In English, a sentence like, “I run track and field after school” would be tokenized something like this, “I”, “run”, “track”, “and”, “field”, “after”, “school”, “.” This way, a computer can take each word and punctuation and process it individually to make it easier to understand. Word embeddings can also be compared to each other to generate understanding. For example, the vector for “house” would be close to the vector for “home” and far from “office.”

A human learning English would break down the sentence “I run track and field after school” in much the same way. The first word would give them information about the subject; the second word would give them information about the action being performed; the third, fourth, and fifth words would give them information about the name of the action; and the sixth and seventh words would give information about time and place.

Cleaning and Processing

In addition to getting tokenized, text data is also cleaned by removing unnecessary characters, punctuation, and information. Often, this includes lowercasing text, removing stop words such as “and” and “the” that don't carry as much meaning as other words, and reducing words to their base form. So, with the sentence example above, the processed text would look something like “run,” “track”, “field”, and “school.”

Annotation and Labeling

The vast majority of data used to train AI models aren't annotated — as this is a very resource-intensive and time-consuming task — so most models learn in an unsupervised way. However, there are some instances where data is annotated by humans after the initial training phase.

In this case, human annotators go through the text and add labels or annotations to indicate the meaning, sentiment, or intent associated with words and phrases. This helps computers understand the meaning of a sentence.

Our sentence example above is pretty matter-of-fact. A human annotator would probably label it as such because the words used do not carry the overt sentiment. If we modified the sentence to state “I am excited to run track and field after school,” an annotator would annotate “excited” as a positive sentiment to teach computers to extract this meaning from “excited” and its synonyms.

One of the most popular examples of generative AI, OpenAI’s ChatGPT, used human annotators to look through thousands of snippets of text to label examples of toxic language so that ChatGPT could be trained on those labels and prevent it from using that language in its interactions with users. (However, what sounds like a great initiative is also laced with controversy since OpenAI outsourced this work to Kenyan workers and paid them less than $2 an hour for a job that exposed them to graphic and violent text.)

Training

After the text data has been collected, cleaned, and labeled, it can then be fed to the computer model. The computer will learn about language patterns, relationships between words, and the meaning of those words.

Deployment and Feedback

The trained model can finally be deployed to perform tasks like language translation or chatbot interactions. The interactions that users have with the model are used to ensure that the model is continually learning new things about the language.

A Lifelong Process

As with humans, natural language processing is a lifelong process for computer models. There are many complex processes that must happen before a computer model can interact with humans in the way we have come to know through the Alexas, Siris, Bixbys, and Google Assistants of our world.