>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> vectorizer.get_feature_names_out()
array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
'this'], ...)
>>> print(X.shape)
(4, 9)
build_analyzer()[source]
Return a callable to process input data.
The callable handles preprocessing, tokenization, and n-grams generation.
Returns:
analyzer: callableA function to handle preprocessing, tokenization
and n-grams generation.
build_preprocessor()[source]
Return a function to preprocess the text before tokenization.
Returns:
preprocessor: callableA function to preprocess the text before tokenization.
build_tokenizer()[source]
Return a function that splits a string into a sequence of tokens.
Returns:
tokenizer: callableA function to split a string into a sequence of tokens.
decode(doc)[source]
Decode the input into a string of unicode symbols.
The decoding strategy depends on the vectorizer parameters.
Parameters:
docbytes or strThe string to decode.
Returns:
doc: strA string of unicode symbols.
Parameters:
raw_documentsiterableAn iterable which generates either str, unicode or file objects.
yNoneThis parameter is not needed to compute tfidf.
Returns:
selfobjectFitted vectorizer.
fit_transform(raw_documents, y=None)[source]
Learn vocabulary and idf, return document-term matrix.
This is equivalent to fit followed by transform, but more efficiently
implemented.
Parameters:
raw_documentsiterableAn iterable which generates either str, unicode or file objects.
yNoneThis parameter is ignored.
Returns:
Xsparse matrix of (n_samples, n_features)Tf-idf-weighted document-term matrix.
get_feature_names_out(input_features=None)[source]
Get output feature names for transformation.
Parameters:
input_featuresarray-like of str or None, default=NoneNot used, present here for API consistency by convention.
Returns:
feature_names_outndarray of str objectsTransformed feature names.
get_metadata_routing()[source]
Get metadata routing of this object.
Please check User Guide on how the routing
mechanism works.
Returns:
routingMetadataRequestA MetadataRequest
encapsulating
routing information.
Parameters:
deepbool, default=TrueIf True, will return the parameters for this estimator and
contained subobjects that are estimators.
Returns:
paramsdictParameter names mapped to their values.
set_fit_request(*, raw_documents: bool | None | str = '$UNCHANGED$') → TfidfVectorizer[source]
Request metadata passed to the fit
method.
Note that this method is only relevant if
enable_metadata_routing=True
(see sklearn.set_config
).
Please see User Guide on how the routing
mechanism works.
The options for each parameter are:
True
: metadata is requested, and passed to fit
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it to fit
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED
) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
This method is only relevant if this estimator is used as a
sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
Parameters:
raw_documentsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGEDMetadata routing for raw_documents
parameter in fit
.
Returns:
selfobjectThe updated object.
set_params(**params)[source]
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as Pipeline
). The latter have
parameters of the form <component>__<parameter>
so that it’s
possible to update each component of a nested object.
Parameters:
**paramsdictEstimator parameters.
Returns:
selfestimator instanceEstimator instance.
set_transform_request(*, raw_documents: bool | None | str = '$UNCHANGED$') → TfidfVectorizer[source]
Request metadata passed to the transform
method.
Note that this method is only relevant if
enable_metadata_routing=True
(see sklearn.set_config
).
Please see User Guide on how the routing
mechanism works.
The options for each parameter are:
True
: metadata is requested, and passed to transform
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it to transform
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED
) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
This method is only relevant if this estimator is used as a
sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
Parameters:
raw_documentsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGEDMetadata routing for raw_documents
parameter in transform
.
Returns:
selfobjectThe updated object.
transform(raw_documents)[source]
Transform documents to document-term matrix.
Uses the vocabulary and document frequencies (df) learned by fit (or
fit_transform).
Parameters:
raw_documentsiterableAn iterable which generates either str, unicode or file objects.
Returns:
Xsparse matrix of (n_samples, n_features)Tf-idf-weighted document-term matrix.
Biclustering documents with the Spectral Co-clustering algorithm
Biclustering documents with the Spectral Co-clustering algorithm
Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation
Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation
Sample pipeline for text feature extraction and evaluation
Sample pipeline for text feature extraction and evaluation
Column Transformer with Heterogeneous Data Sources
Column Transformer with Heterogeneous Data Sources
Classification of text documents using sparse features
Classification of text documents using sparse features
Clustering text documents using k-means
Clustering text documents using k-means
FeatureHasher and DictVectorizer Comparison
FeatureHasher and DictVectorizer Comparison