Data Science

rachel_k · ‎01-29-2024

Businesses are becoming more data-driven, but a significant portion of data is unstructured text - not numeric, tabular data. The average adult can read about 238 words per minute, which makes manual text review slow and expensive (1). The Alteryx Intelligence Suite Text Mining tools are designed to automate text processing tasks to help businesses increase efficiency and spend more time on tasks that matter.

This post will focus on the Text Classification and Zero-Shot Text Classification tools and includes a demo that sorts e-commerce listings from this Kaggle dataset into four categories: Household, Electronics, Books, and Clothing & Accessories. 8,113 records were classified, totaling 934,619 words. An average employee might take over 65 hours to just read all 8,113 records. The provided workflow took about 30 minutes to create and 2.5 hours to run both Text Classification and Zero-Shot Text Classification, unlocking over 60 hours of productivity for this single task. Let’s get started!

Source: GIPHY

Note that Alteryx Designer Desktop and Alteryx Intelligence Suite are required to use these tools.

Why Use Text Classification?

Text Classification and Zero-shot text classification do exactly what they sound like – they classify a field of text into different categories. This could be useful for automating the following types of use cases:

Use written product descriptions to sort products into categories
Classify tickets coming into a help desk/IT desk/support desk to route them to the direct department or to sort them by severity
Add tags to product descriptions, comments, news articles, etc.
Monitor and track engagement in social media for posts on specific topics or content
Moderate social media posts by identifying and removing posts containing specific topics
Classify and route inbound documents or email to the correct person or department based on the content of the message
Sort free text survey responses, such as employee satisfaction or customer/patient satisfaction, based on the topic of the response
Categorizing job descriptions into types of roles, experience levels, or line of business

Zero-shot Text Classification

Zero-shot Text Classification was released with Alteryx Designer version 22.3 and classifies text into user-defined groups using a pre-trained Hugging Face model that comes installed with Alteryx Intelligence Suite and executes using ONNX Runtime for quick execution. Because Zero-shot comes with a pre-trained model, there’s no need for pre-labeled training data or any up-front training.

Zero-shot Text Classification is great for classifying text into well-known categories that aren’t highly specialized or expertise-dependent. For example, the demo workflow classifies products into four simple, understandable categories: Household, Electronics, Books, and Clothing & Accessories. Specialized knowledge shouldn’t be required to classify data for most of the provided product descriptions.

No significant data preparation is required for Zero-shot text classification. In the provided use case, the Text Pre-processing tool was used to remove extraneous punctuation, however, this step is not required. If punctuation is helpful in understanding the context of the text, this step would not be recommended. Data to be scored is connected to the D input anchor, and a list of potential categories is input to the L input anchor.

Tool configuration is also simple. In the “Column with Text” field, select the text column to classify (from the D input anchor). In the “Column with Labels” field, select the column with potential categories (from the L input anchor). If the “Multi-label Classification” box is unchecked, each record will be assigned to the most likely category.

A numeric column will be added to the output anchor for each category in the L input anchor. The value in each category column represents the certainty that the text in that record belongs in that column’s category. Note that the likelihood values across categories will not add up to 1, and instead represent a likelihood within the individual category only. If the Multi-label Classification setting is unchecked, there will be one more column titled predicted_label in the output, which includes the name of the most likely category (highest likelihood value) for each record.

Text Classification

Some data sets that are highly business or use-case-specific may need a specialized model to accurately categorize text. The Text Classification tool, released in Designer version 23.1, allows users to leverage their own labeled data to train a customized model for text classification. For demonstration purposes, this example uses the same dataset from the Zero-shot text classification overview, but in practice, the dataset used could be any business, business-line, or industry-specific dataset.

In the case of the Text Classification tool, some preparation with the Text Pre-processing tool is recommended – specifically the option ‘Convert to Word Root (Lemmatize)’. This option converts variants of a word (such as changing, changed, changes, changer) to the root word (change), and may help the classification model contextualize the text content. Optionally, digits or punctuation might also be removed with the Text Pre-processing tool.

The create samples tool was used to split the data into training, validation, and holdout datasets. The E and V anchors output from the tool are input as the Training (T) and Validation (V) input anchors of the Text Classification tool. The tool outputs are a model object (M anchor), and information about the model (E anchor).

To configure the tool, select the columns with text and labels for both the training and validation text (T and V input anchors, respectively). In the Advanced area, select the algorithm to use for the model. Currently, two methodologies are available: Multinomial Naïve Bayes, and Linear SVC, etc. with tunable hyperparameters described in the tool documentation. A third option, Auto Mode, evaluates both methods and identifies an ideal set of hyperparameters. Auto Mode was used in the provided workflow, to demonstrate how simple it is to configure the tool, and how well the model performs without users needing deep expertise in text classification algorithms and parameters.

Once the Text Classification tool is configured, use the Machine Learning Predict tool to evaluate the model with new data. The M output from the Text Classification tool connects to the Model (M) input anchor. The data to be scored, here from the holdout (H) anchor of the Create Samples tool, connects to the Data (D) input. The Machine Learning Predict tool does not require any configuration. The complete process looks like this:

Results: How accurate are the models?

A total of 8,113 records were classified using Zero-Shot Text Classification and Text Classification, with a total of 934,619 words. Again, this represents about 65 hours of human reading time; Alteryx run time for Text Classification and Zero-shot Text classification was around 2.5 hours.

Contingency Table tools quantified the accuracy of both text classification methods. Zero-shot Text Classification accurately classified products as Books, Clothing & Accessories, Electronics, or Household 75.3% of the time (1100+1351+1125+2530 / 8113).

The Text Classification tool, which was trained specifically to the dataset (but not the specific data that was scored) performed even better, with 95.9% accuracy (1844+1449+1265+3224 / 8113).

Zero-shot classification performed decently, considering that the underlying model did not have expertise in the exact data being used. The Text Classification model significantly improved those rates. The AutoModel algorithm setting was able to identify and tune an optimal model, with no expertise required from the end user.

For either method, the workflow was simple to configure. Depending on the use case, the accuracy ranges from ‘pretty good’ to ‘comparable with a human’, but the automation frees up time for a person to work on other tasks, or to focus in on hard-to-classify records.

Refining Models

Sometimes, getting to a model that performs well may require iteration in the number of categories, or the name of the categories used. For example, in the Zero-shot Text Classification tool, a large number of records labeled as ‘Books’ were incorrectly classified as ‘Household.’ To reduce errors, a user might review the data, and consider several options, such as:

Splitting the “books” category into two categories: “Literature” and “Cookbooks”
Splitting the “household” category into “Furniture” and “Home Design Books”
Combining “books” and “household” into “Books & Household”

Similar iteration may be required in developing categories/labels for the Text Classification tool, however, the categories provided worked quite well in this use case.

When to Use Text Classification and When to Use a Large Language Model

When is it right to use Text Classification, Zero-shot text classification, or a large language model like Chat GPT? The table below summarizes each method across several business requirements. If cost, privacy, accuracy, or customization are a priority, text classification may be ideal. If flexibility is a requirement, there is no training data, or categories are not highly specialized, then Zero-shot classification or LLM’s may be superior.

	Text Classification	Zero-shot Text Classification	Off-the-shelf Large Language Model
Cost	No cost associated with training or scoring	No cost associated with training or scoring	Cost to train, cost per text classified
Customization and Accuracy	Can be trained to specific data	Uses a pre-trained model that isn’t customizable	Not easily customizable
Training	Requires labeled training data	No training or training data required	No training or training data required
Flexibility/ Retraining	Adding a new label requires retraining	Easy to add a new label	Easy to add a new label
Bias	Bias would exist based on user-provided data	User can’t control training data	User can’t control training data
Privacy	Model can be trained to private data and scoring happens within private environment	Model trained on public data, but scored in private environment	Model trained on public data and scored data leaves the private environment

About Alteryx Professional Services

The Alteryx Professional Services team is a group of trusted advisors who can assist you as you define, develop, and execute your vision for the Alteryx platform. We accelerate your analytics capabilities through deployment and migration, creating process control and governance frameworks, building automation efficiencies, or developing custom analytic solutions. Get in touch if you want help on your analytic journey!

References:

1. Brysbaert, Marc. "How many words do we read per minute? A review and meta-analysis of reading rate." Journal of memory and language 109 (2019): 104047.