Machine Learning for Mobile Venture Analytics

Fayrix Machine Learning solution helps mobile marketers and analysts identify niches and mobile trends on the App Store to estimate potential market size & chance of success for a mobile app.
Too many apps is an issue.
The developing mobile app market will reach $189B by the year of 2020. The number of apps just on the App Store is over 3 million, 783,000 of which are games, and the rest are non-gaming applications. Although all the games are divided into 19 categories (each of them having 41,000 games on average), 2,3 million of non-gaming apps are divided in just 24 categories (almost 96,000 apps in each category).
User's issue. Too many apps, too few categories.
From one hand, often developers have to release their apps in the categories which do not precisely refer to their app's functionality. From the other hand, users have to adapt their needs to the limited number of categories which haven't even changed since the advent of the App Store. This makes much more difficult to find a necessary app. App Store search is not helping as well for a number of reasons. It's based solely on either app name or keywords (not more than 100 words) which a developer must have considered before publishing their app. If you think about how a person gets to know something new, it's clear that there's little chance that a brand-new app in a novel niche with no analogs will actually succeed. Since a person's needs are always connected with something familiar. A man who knows nothing about airplanes and airtravel, will never consider flying.
Developer's issue. Developers have to invest a lot of resources in the promotion of a brand-new product.
Mobile apps often are a VC-backed projects. The conservatism of app stores causes a lot of issues for investment activity. As of now, investors don't have a tool to objectively analyze and distinguish mobile trends at the early stage of the projects which start a breakthrough in the mobile sphere.
Investor's problem. It is impossible to find fresh innovative projects in 3 million apps systematically.
Distributing apps into objective groups which reflect the actual functionality and features of apps. Our source data is app descriptions since in these descriptions developers are trying to present their idea and the benefits of their product. Our source data include about 20,000 descriptions of iOS apps from 20 categories of the Apple App Store USA. Our goal is to distinguish objective app categories and define the optimal number of them.

The first step in description analysis is its preliminary analysis. All of the stop words, low-value characters, and most of the punctuation marks should be deleted from the texts of the descriptions. Moreover, these stop words can include general words not related to the subject (prepositions, participles, numbers, interjections, and particles), as well as specific words (app, http, www, email …). The rest of the words should be reverted to the initial form with the help of lemmatization and stemming algorithms.
To solve this problem, we have to understand what apps already published on the app stores are similar to each other. Our goal is to form clusters of apps, so we need to think of how we can conduct the clustering of app descriptions.

The clustering algorithms like k-means don't work well with texts and require an array of numerical vectors for input. Therefore, we have to form the following array out of our texts. Let's see what methods will help us with that.

TF-IDF (short for term frequency–inverse document frequency) is a statistical indicator which is used for evaluating the significance of a specific word in the context of each app description.

TF (short for term frequency) is a ratio of the number of input specific words to the total number of words in the studied document. This indicator shows the importance of a word within one app description.

IDF is inverse document frequency of a specific word in the document is the text corpus. This indicator helps decrease the weighting factor of the most frequently used words (prepositions, conjunctions, and common terms).

TF/IDF ratio will be higher if a specific word appears in a certain document with a higher frequency, however rarely in other documents as well.

The result of the algorithm is an N-length vector where N is a number of unique words in a text corpus. The vector length for a typical text is about 50,000 – 10,000 words. For further analysis, we'll need to decrease this length.

Such algorithms of transforming words as Word2Vec, Glove and FastText help transform text corpus to vector space. The size of the vector space is not over a few hundreds.

To simplify, this method can be described the following way: the algorithms is trying to predict a current word based on the nearby words. As a result of this algorithm, the obtained vectors of connected words appear close to each other in the vector space. For example, the closest words to king are queen, princess.

To obtain a vector of the entire document (i.e. of the app description), as a rule, the word vector are summed. Here, we can see an evident disadvantage which is an obtained vector is noised by the word vectors which are not related to the meaning of the text. Transforming words to vectors works best for singular words, well for word combinations and bad for phrases. For long documents, this method doesn't work at all. To improve the vector transformation of texts, we need to thoroughly clean the analyzed text from insignificant words or to use a weighted sum of vectors. While the choice of the algorithm for assigning weights is not always apparent.

Another disadvantage of this method is a dependence of the quality of the model from what text corpus was used for machine learning. For the Russian language, in order to conduct machine learning successfully, it is necessary to collect rather large themed text corpus usually with over 10 million words.

Auto-encoder based on a neural network

Auto-encoder is an algorithm for unsupervised machine learning which uses neural network so that the input vector of X values triggers a response of the network of Y values equal to the input vector (Y = X).

Auto-encoder sample:
Besides, the architecture of the network has the following limitations:
  • The number of the neurons of the hidden layer L2 should be less than the size of the input data (see diagram);
  • The activation of the neurons of the hidden layer should be sparse (the number of the inactive neurons of the hidden layer should be far greater than that of the active ones). Usually, the percentage of the active neurons is not greater than 5%
The advantage of this method is the fact that the obtained vector (usually that is the output of the hidden layer L2) conveys the meaning of the input text rather accurately. However, the disadvantage of this method is that a large and quality text corpus is needed for learning.

Analyzing the above-mentioned methods, we concluded that none of them provides enough accuracy. Eventually, we applied a bit different method the description of which is beyond this article's subject.

Unfortunately, the size of this vector is several hundreds of values. An attempt to cluster with the vector of such sizes will take a lot of time even with the simultaneously running algorithms. We need to find ways of reducing the size of the data.

Reducing data size with methods from machine learning and data mining

Here are a few ways that are commonly used in data mining to reduce database size. Let's take a look at them and compare.

One way to decrease the size of the dataset with a minimum data loss is with principal component analysis. With the selection of optimal projections, the algorithm eliminates redundancy and correlation in the array of input vectors. As a result, we obtained a set of significant and uncorrelated components. The main disadvantage of this method roots in the fact that essential projections are being done by the covariance matrix the size of which is proportionate to the size of the input dataset. Therefore, for a large sets of data this method of finding eigenvectors might not be possible.

Singular-value decomposition is used for the factorization of the matrix of text features. The input matrix is decomposed into several components the physical essence of which is sequential linear operators of rotation and extension of input vectors. The components of singular decomposition visualize the geometric changes when modifying the size of the vector space.

The self-organizing map of Kohonen is a type of neural networks that is trained using unsupervised learning. The main goal of this algorithm is to find hidden patterns in the dataset by decreasing the size of the input vector space.
The significant features of such maps is that they visualize the reduced space so that the topology of the input space doesn't change. It means that the obtained projections can be compared to each other, the size between them can be measured etc.
The advantage of SOMs is their stability even with noisy data and fast learning process. The disadvantage is that the final result of the work of neural networks depends from the initial network settings.

This is a popular algorithm for decreasing the size of the dataset. This algorithm enables to reduce hundreds of dimensions to just two keeping the significant correlations between datasets unchanged: the closer the objects are in the input space, the shorter the distance between these objects will be in the reduced space. t-SNE works well with small- and medium-sized real datasets and doesn't require a lot of adjustments for hyperparameters.
In our case, t-SNE algorithm showed the best results for reducing the size of the vector.
The result of running t-SNE algorithm is presented in the following diagram:
The projections of the vectors of app description in the space created with t-SN
As a result, the accumulated groups of apps can be vividly seen in the diagram. Now it's time to cluster the apps. But how many clusters should be distinguished? 30, 50 or 100?

Unfortunately, there's still no solution to this problem in cluster analysis and we should made our decision basing on subjective factors. For example, if the number of clusters is too small, then selectivity will be lost, otherwise, if it's too large, we'll lose the ability to generalize. So we need to find a balanced solution.

Before clustering, we need to find an optimal number of clusters

In cluster analysis, there are several approximate methods of finding this number. One of them, supposedly the clearest and the most intuitive one, is the Elbow method. This method supposes designing a graph showing the correlation between the dispersion of the quality of clustering and the number of cluster k.
Initially, adding new cluster substantially helps the model, but at some point the increasing number of clusters stops improving the quality of clustering. Then, this point indicates the optimal number of clusters. However, it's necessary to visualize the quality of the model for checking it. For example, in our case, as it's seen in the graph above, the optimal value of k is 40-50, but after checking the obtained clusters, many apps with absolutely different features turned out to be in the same cluster. That's why we decided to increase k value to 100.

K-means algorithm in action

Now we are conducting clustering using K-means algorithms with k = 100 and check app descriptions within the obtained clusters.
Clustering t-SNE projections
In the diagram, we can see colored and numbered cluster groups. What can be noticed right away? Some clusters are detached from others and don't have any "neighbors". But there are some clusters which have a clear border with other clusters. For example, in the highlighted area we can see clusters ranging from 0 to 60 which divide a large group of apps in half. Our hypothesis is that in this case as a result of clustering we obtained two objective categories.

Let's take a look at what apps ended up in these two clusters and try to understand whether we reached our goal. Firstly, we can draw a bar chart of market categories of apps which are in cluster 0 to 60.
Bar chart: Distribution of market categories into clusters 0 to 60
Now let's look at a few apps and top 50 words from app descriptions in each of the clusters.

Babbel – Learn Italian
Innovative 101 Learn Languages
Learn Danish: Language Course
Page: English Grammar Checker
Oxford Spanish Dictionary

word phrase learn language speak english vocabulary dictionary spanish lessons native grammar audio babbel sentence speech pronunciation travel sign search help chinese course practice voice nemo translate study rhyme write italian term phrasebook hear time french speakers asl feature odyssey korean categories like record review spell read period text japanese

Live Translator HQ
Offline Translator Pro 8 lang
Translator with Speech Pro
Spanish Dictionary +
Arabic Dictionary Elite

word english translate dictionary languages text language translation phrase spanish translator voice french chinese speech translations pronunciation japanese portuguese italian german speak russian arabic offline sentence search learn korean dutch support danish polish turkish dictionaries swedish norwegian finnish czech greek definitions hebrew romanian hindi vocabulary internet thai hungarian connection catalan

If judging by top 50 words it's hard to define what the app is like, then by looking at the app names it's easy to understand that cluster 0 comprises apps primarily designed for learning foreign languages, while cluster 60 mostly consists of translators.

To sum up, if we take three market categories from the bar chart of the cluster, we will obtain the name of objective pseudo-category which is relevant to this cluster. For instance, that can look like this:

cluster 0 - education_travel_reference
cluster 60 - reference_travel_books

The first one is a primary market category, the latter two are clarifying categories which give additional semantic detail to the primary category.

In order to evaluate the quality of clustering and check how market categories are distributed into cluster, there's one more method of visualizing - heatmap. It shows how and in what proportions market categories are distributed in clusters. Moreover, heatmap allows to see the hierarchy of clusters meaning how clusters are correlated with each other.
Here's what it looks like in our case.
Here we can see independent cluster, as well as groups of tightly connected clusters. Groups of cluster feature primary market categories within which app may have different functionality. For example, in the right bottom corner Entertainment group has additional clarifying categories - Photo&Video, Shopping, Music, Lifestyle.
Applying different methods of analysis based on machine learning, we created a tool which enables to systematically monitor mobile market and its trends, to distinguish brand-new app which cannot be referred to any existing app category published on the app stores.

The suggested model for clustering was developed in partnership with Robolitica and will be used for analyzing mobile trends.