Feature extraction is a technique that allows us to obtain vector representations of text. These representations can be used for various downstream tasks such as similarity comparison, clustering, or as input to other machine learning models.
The Hugging Face Transformer Library provides an easy-to-use pipeline for feature extraction. Let's explore this with an example:
Example: Feature Extraction
from transformers import pipeline
import torch
# Initialize the feature-extraction pipeline
feature_extractor = pipeline("feature-extraction", model="facebook/bart-large-mnli", revision="c626438")
# Example texts
texts = [
"The quick brown fox jumps over the lazy dog.",
"I love machine learning and natural language processing.",
"Feature extraction is a crucial step in many NLP tasks."
]
# Extract features
features = feature_extractor(texts)
# Convert to tensors for easier manipulation
tensors = [torch.tensor(feature) for feature in features]
print(f"Number of texts processed: {len(tensors)}")
print(f"Shape of each feature vector: {tensors[0].shape}")
# Print the first few values of each feature vector
for i, tensor in enumerate(tensors):
print(f"\nFirst few values of feature vector for text {i+1}:")
print(tensor[0, :5]) # Print first 5 values of the first token
# Calculate cosine similarity between vectors
def cosine_similarity(v1, v2):
return torch.nn.functional.cosine_similarity(v1, v2, dim=1)
print("\nCosine similarities:")
print("Between text 1 and 2:", cosine_similarity(tensors[0], tensors[1]).item())
print("Between text 1 and 3:", cosine_similarity(tensors[0], tensors[2]).item())
print("Between text 2 and 3:", cosine_similarity(tensors[1], tensors[2]).item())
The above example demonstrates how to use the feature extraction pipeline from the Hugging Face Transformer Library.
We extract feature vectors for three different texts, print some basic information about these vectors, and then calculate the cosine similarity between them. The cosine similarity gives us a measure of how similar the texts are in the vector space, with values closer to 1 indicating higher similarity.
Feature extraction is used in various NLP applications, such as document classification, clustering, semantic search, and more. By converting text into dense vector representations, we can leverage the power of pre-trained language models to capture complex semantic relationships between words and phrases.

Provide Feedback For This Article
We take your feedback seriously and use it to improve our content. Thank you for helping us serve you better!
😊 Thanks for your time, your feedback has been registered!
Comments & Discussion
Facing issues? Have questions? Post them here! We're happy to help!