import numpy as np
import scipy.sparse
import sympy
import sklearn.datasets
import sklearn.feature_extraction.text
import umap
import umap.plot
import matplotlib.pyplot as plt
%matplotlib inline
A mathematical example
Our first example constructs a sparse matrix of data out of pure math.
This example is inspired by the work of John
Williamson, and
if you haven’t looked at that work you are strongly encouraged to do so.
The dataset under consideration will be the integers. We will represent
each integer by a vector of its divisibility by distinct primes. Thus
our feature space is the space of prime numbers (less than or equal to
the largest integer we will be considering) – potentially very high
dimensional. In practice a given integer is divisible by only a small
number of distinct primes, so each sample will be mostly made up of
zeros (all the primes that the number is not divisible by), and thus we
will have a very sparse dataset.
To get started we’ll need a list of all the primes. Fortunately we have
sympy
at our disposal and we can quickly get that information with a
single call to primerange
. We’ll also need a dictionary mapping the
different primes to the column number they correspond to in our data
structure; effectively we’ll just be enumerating the primes.
primes = list(sympy.primerange(2, 110000))
prime_to_column = {p:i for i, p in enumerate(primes)}
Now we need to construct our data in a format we can put into a sparse
matrix easily. At this point a little background on sparse matrix data
structures is useful. For this purpose we’ll be using the so called
“LIL”
format.
LIL is short for “List of Lists”, since that is how the data is
internally stored. There is a list of all the rows, and each row is
stored as a list giving the column indices of the non-zero entries. To
store the data values there is a parallel structure containing the value
of the entry corresponding to a given row and column.
To put the data together in this sort of format we need to construct
such a list of lists. We can do that by iterating over all the integers
up to a fixed bound, and for each integer (i.e. each row in our dataset)
generating the list of column indices which will be non-zero. The column
indices will simply be the indices corresponding to the primes that
divide the number. Since sympy
has a function primefactors
which
returns a list of the unique prime factors of any integer we simply need
to map those through our dictionary to covert the primes into column
numbers.
Parallel to that we’ll construct the corresponding structure of values
to insert into a matrix. Since we are only concerned with divisibility
this will simply be a one in every non-zero entry, so we can just add a
list of ones of the appropriate length for each row.
%%time
lil_matrix_rows = []
lil_matrix_data = []
for n in range(100000):
prime_factors = sympy.primefactors(n)
lil_matrix_rows.append([prime_to_column[p] for p in prime_factors])
lil_matrix_data.append([1] * len(prime_factors))
CPU times: user 2.07 s, sys: 26.4 ms, total: 2.1 s
Wall time: 2.1 s
Now we need to get that into a sparse matrix. Fortunately the
scipy.sparse
package makes this easy, and we’ve already built the
data in a fairly useful structure. First we create a sparse matrix of
the correct format (LIL) and the right shape (as many rows as we have
generated, and as many columns as there are primes). This is essentially
just an empty matrix however. We can fix that by setting the rows
attribute to be the rows we have generated, and the data
attribute
to be the corresponding structure of values (all ones). The result is a
sparse matrix data structure which can then be easily manipulated and
converted into other sparse matrix formats easily.
factor_matrix = scipy.sparse.lil_matrix((len(lil_matrix_rows), len(primes)), dtype=np.float32)
factor_matrix.rows = np.array(lil_matrix_rows)
factor_matrix.data = np.array(lil_matrix_data)
factor_matrix
<100000x10453 sparse matrix of type '<class 'numpy.float32'>'
with 266398 stored elements in LInked List format>
As you can see we have a matrix with 100000 rows and over 10000 columns.
If we were storing that as a numpy array it would take a great deal of
memory. In practice, however, there are only 260000 or so entries that
are not zero, and that’s all we really need to store, making it much
more compact.
The question now is how can we feed that sparse matrix structure into
UMAP to have it learn an embedding. The answer is surprisingly
straightforward – we just hand it directly to the fit method. Just like
other sklearn estimators that can handle sparse input UMAP will detect
the sparse matrix and just do the right thing.
%%time
mapper = umap.UMAP(metric='cosine', random_state=42, low_memory=True).fit(factor_matrix)
CPU times: user 9min 36s, sys: 6.76 s, total: 9min 43s
Wall time: 9min 7s
That was easy! But is it really working? We can easily plot the results:
umap.plot.points(mapper, values=np.arange(100000), theme='viridis')
And this looks very much in line with the results John Williamson
got with the
proviso that we only used 100,000 integers instead of 1,000,000 to
ensure that most users should be able to run this example (the full
million may require a large memory compute node). So it seems like this
is working well. The next question is whether we can use the
transform
functionality to map new data into this space. To test
that we’ll need some more data. Fortunately there are more integers.
We’ll grab the next 10,000 and put them in a sparse matrix, much as we
did for the first 100,000.
%%time
lil_matrix_rows = []
lil_matrix_data = []
for n in range(100000, 110000):
prime_factors = sympy.primefactors(n)
lil_matrix_rows.append([prime_to_column[p] for p in prime_factors])
lil_matrix_data.append([1] * len(prime_factors))
CPU times: user 214 ms, sys: 1.99 ms, total: 216 ms
Wall time: 222 ms
new_data = scipy.sparse.lil_matrix((len(lil_matrix_rows), len(primes)), dtype=np.float32)
new_data.rows = np.array(lil_matrix_rows)
new_data.data = np.array(lil_matrix_data)
new_data
<10000x10453 sparse matrix of type '<class 'numpy.float32'>'
with 27592 stored elements in LInked List format>
To map the new data we generated we can simply hand it to the
transform
method of our trained model. This is a little slow, but it
does work.
new_data_embedding = mapper.transform(new_data)
And we can plot the results. Since we just got the locations of the
points this time (rather than a model) we’ll have to resort to
matplotlib for plotting.
fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(111)
plt.scatter(new_data_embedding[:, 0], new_data_embedding[:, 1], s=0.1, c=np.arange(10000), cmap='viridis')
ax.set(xticks=[], yticks=[], facecolor='black');
The color scale is different in this case, but you can see that the data
has been mapped into locations corresponding to the various structures
seen in the original embedding. Thus, even with large sparse data we can
embed the data, and even add new data to the embedding.
A text analysis example
Let’s look at a more classical machine learning example of working with
sparse high dimensional data – working with text documents. Machine
learning on text is hard, and there is a great deal of literature on the
subject, but for now we’ll just consider a basic approach. Part of the
difficulty of machine learning with text is turning language into
numbers, since numbers are really all most machine learning algorithms
understand (at heart anyway). One of the most straightforward ways to do
this for documents is what is known as the “bag-of-words”
model. In this
model we view a document as simply a multi-set of the words contained in
it – we completely ignore word order. The result can be viewed as a
matrix of data by setting the feature space to be the set of all words
that appear in any document, and a document is represented by a vector
where the value of the ith entry is the number of times the ith
word occurs in that document. This is a very common approach, and is
what you will get if you apply sklearn’s CountVectorizer
to a text
dataset for example. The catch with this approach is that the feature
space is often very large, since we have a feature for each and every
word that ever occurs in the entire corpus of documents. The data is
sparse however, since most documents only use a small portion of the
total possible vocabulary. Thus the default output format of
CountVectorizer
(and other similar feature extraction tools in
sklearn) is a scipy.sparse
format matrix.
For this example we’ll make use of the classic 20newsgroups dataset, a
sampling of newsgroup messages from the old NNTP newsgroup system
covering 20 different newsgroups. The sklearn.datasets
module can
easily fetch the data, and, in fact, we can fetch a pre-vectorized
version to save us the trouble of running CountVectorizer
ourselves.
We’ll grab both the training set, and the test set for later use.
news_train = sklearn.datasets.fetch_20newsgroups_vectorized(subset='train')
news_test = sklearn.datasets.fetch_20newsgroups_vectorized(subset='test')
If we look at the actual data we have pulled back, we’ll see that
sklearn has run a CountVectorizer
and produced the data in the sparse
matrix format.
news_train.data
<11314x130107 sparse matrix of type '<class 'numpy.float64'>'
with 1787565 stored elements in Compressed Sparse Row format>
The value of the sparse matrix format is immediately obvious in this
case; while there are only 11,000 samples there are 130,000 features! If
the data were stored in a standard numpy
array we would be using up
10GB of memory! And most of that memory would simply be storing the
number zero, over and over again. In the sparse matrix format it easily fits
in memory on most machines. This sort of dimensionality of data is very
common with text workloads.
The raw counts are, however, not ideal since common words such as “the” and
“and” will dominate the counts for most documents, while contributing
very little information about the actual content of the document. We can
correct for this by using a TfidfTransformer
from sklearn, which
will convert the data into TF-IDF
format. There are lots
of ways to think about the transformation done by TF-IDF, but I like to
think of it intuitively as follows. The information content of a word
can be thought of as (roughly) proportional to the negative log of the
frequency of the word; the more often a word is used, the less
information it tends to carry, and infrequent words carry more
information. What TF-IDF is going to do can be thought of as akin to
re-weighting the columns according to the information content of the
word associated to that column. Thus the common words like “the” and
“and” will get down-weighted, as carrying less information about the
document, while infrequent words will be deemed more imporant and have
their associated columns up-weighted. We can apply this transformation
to both the train and test sets (using the same transformer trained on
the training set).
tfidf = sklearn.feature_extraction.text.TfidfTransformer(norm='l1').fit(news_train.data)
train_data = tfidf.transform(news_train.data)
test_data = tfidf.transform(news_test.data)
The result is still a sparse matrix, since TF-IDF doesn’t change the
zero elements at all, nor the number of features.
train_data
<11314x130107 sparse matrix of type '<class 'numpy.float64'>'
with 1787565 stored elements in Compressed Sparse Row format>
Now we need to pass this very high dimensional data to UMAP. Unlike some
other non-linear dimension reduction techniques we don’t need to apply
PCA first to get the data down to a reasonable dimensionality; nor do we
need to use other techniques to reduce the data to be able to be
represented as a dense numpy
array; we can work directly on the
130,000 dimensional sparse matrix.
%%time
mapper = umap.UMAP(metric='hellinger', random_state=42).fit(train_data)
CPU times: user 8min 40s, sys: 3.07 s, total: 8min 44s
Wall time: 8min 43s
Now we can plot the results, with labels according to the target
variable of the data – which newsgroup the posting was drawn from.
umap.plot.points(mapper, labels=news_train.target)
We can see that even going directly from a 130,000 dimensional space
down to only 2 dimensions UMAP has done a decent job of separating out
many of the different newsgroups.
We can now attempt to add the test data to the same space using the
transform
method.
test_embedding = mapper.transform(test_data)
While this is somewhat expensive computationally, it does work, and we
can plot the end result:
fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(111)
plt.scatter(test_embedding[:, 0], test_embedding[:, 1], s=1, c=news_test.target, cmap='Spectral')
ax.set(xticks=[], yticks=[]);