In order to have a better understanding of an app's overall presence and performance in the App Store ecosystem, we felt that it was important to provide you with a quick way to navigate to the App Profile pages of an app in either the iTunes or Google Play store.
In case you were wondering how these associations were made, you're in luck! This blog post is dedicated to the technical challenges we encountered and processes we put in place in order to bring you this seemingly inconspicuous feature into reality.
To us humans, associating the same app in the two stores seems dead simple. For example, let's consider Pandora Radio's mobile apps. You can check out the iOS App Profile page and Android App Profile page for reference, but the attributes we will be examining are listed below:
When placed side-by-side, it is pretty obvious to us that these two apps are associated with each other. This is because there are many things that we take for granted when we make the association in our minds.
However, when we have to create a defined set of rules for a computer to follow, the task actually becomes quite complex. Here is how we figured out an automated process for matching apps between the two stores.
In order to get a better understanding of how the process works, let's first look at the overall decision making process. Then we will get into the three specific steps that we took to determine app associations.
In a perfect world, corresponding apps would have the exact same name in the iTunes and Google Play stores. Unfortunately, now-defunct impostor apps like these tend to clutter the selection process (the impostor app is no longer available in the Google Play store).
So exact name matching is not enough.
In order to find all possible reasonable candidates for association, we search for the first 100 apps (to limit query time) in our database that share any words with the base app (the app we're trying to find a match for). Usually this step alone is enough to ensure that the best match is in the candidate pool, however common words such as "line", "flappy", or "walk" match with 1000+ apps.
In order to increase the chances of the correct app being in the 100 selected apps, we also search for bigrams in the base app's title. With an app like Pandora Radio, this step is rather useless due to the shorter name length, but with a longer app name like Star Walk™ - 5 Stars Astronomy Guide (matched to Star Walk - Astronomy Guide), it is definitely useful.
As you can see, a strict exact name match would not select the correct astronomy apps for candidate pooling. By breaking down the app name into bigrams ("star walk", "walk 5", "5 stars", "stars astronomy", and "astronomy guide") and searching for apps that contain any of those pairings, we get good results from both "star walk" and "astronomy guide," meaning that the correct app would be within the candidate pool that is passed on to the next step in the association process: text comparison.
This next step has three distinct components - App name check, Publisher name check, and Support URL check.
Each phase of the text comparison is scored based on the degree to which the apps match in each comparison. This will be discussed in more detail in each phase explanation.
In the app name check process, we find out if the names of the base app and the candidate being considered match exactly. If not, we try to standardize the two names (make all lowercase, remove or translate symbols, remove everything after common separators, and removing spaces) and compare them again.
If neither of the equalities come back positive, we check if one name is contained in the other or vice versa for a partial match.
Finally, if there are no equalities nor inclusions met, the candidate is penalized with a negative matching value for its name check.
iTunes: Run with Map My Run - GPS Running, Jog, Walk, Workout Tracking and Calorie Counter
Google Play: Run with Map My Run
Exact check: Nope
Standardized check: Yes (runwithmapmyrun = runwithmapmyrun)
iTunes: The Weather Channel and weather.com - local forecasts, radar, and storm tracking
Google Play: The Weather Channel
Exact check: Nope
Standardized check: Nope (theweatherchannelandweather not equal to theweatherchannel)
Inclusion check: Yes (theweatherchannelandweather contains theweatherchannel)
iTunes: Real Estate by Zillow – Homes & Apartments, For Sale or Rent
Google Play: Zillow Real Estate & Rentals
Exact check: Nope
Standardized check: Nope (realestatebyzillow not equal to zillowrealestateandrentals)
Inclusion check: Nope (zillowrealestateandrentals does not contain realestatebyzillow)
The publisher name check step is almost identical to the app name check step. One detail of note is that the standardize process that we use in both the app name check and publisher name check also removes common company endings such as LLC, LTD, s.a. r.l., GmbH, Inc., etc. - which is useful for associating companies that have differing publisher names in different stores such as King.com ("King.com Limited" in the iTunes store, "King.com" in the Google Play store).
Another strong indicator of association is the support URL domain of the base app and candidate app. To extract domains reliably from URLs that may contain any number of subdomains or have odd top level domain conventions, we utilized a ruby gem called Domainatrix.
For relatively unique URLs such as snapchat(.com), an exact match is a great indicator of association. However, what about apps with support URLs at Facebook.com? How do you properly attribute matching domains when literally 19k+ apps across the iTunes and Google Play stores also use support pages hosted on Facebook.com?
To address this problem, we compiled a blacklist of domains. Each domain on the list (facebook, google, youtube, twitter, github, apple, etc) have at least 20 different apps using them as their listed support URL domain.
The blacklist is used by default. However, in order to properly acknowledge that the actual Facebook apps being associated have properly matching support URL domains, we ignore the blacklist when there are at least partial matches in the app name and publisher name checks.
For example, consider Birthday Cards for Facebook. During the association process for Facebook (the app), this app would be in the initial text comparison candidate pool because it contains the word "Facebook".
App Name - Partial points: "Birthday Cards for Facebook" technically does includes "Facebook"
Publisher Name - Negative points: "Apps-O-Rama" is not equal to "Facebook"
Support URL - Negative points: https://apps.facebook.com/rybirthday/ - The "facebook" domain is on the domain blacklist, and since App Name and Publisher Name are not both at least partial scores, the blacklist is taken into consideration.
App Name - Full points: "Facebook" = "Facebook"
Publisher Name - Full points: "facebook" = "facebook" (standardized from Facebook, Inc.)
Support URL - Full points: http://www.facebook.com/mobile = https://www.facebook.com/facebook
Since app name and Publisher name checks both came back as positive, the blacklist is ignored and the domains match can be properly attributed.
The three scores from app name, publisher name, and support URL are then summed. When a match candidate receives a perfect score (all three text comparisons come back with full points possible), image analysis is skipped and the app is associated.
With a candidate high score of anything less than perfect, the candidate(s) with the highest score of the cohort are then passed on to the final step, image analysis.
In order to detect the degree of similarity between the base app's and candidate app's icons, we needed a hashing algorithm that would not create an avalanche effect when it encountered minute differences -- so MD5, SHA-1 and other cryptographic hashing algorithms were out of the question. These algorithms result in a significantly different output, even if the input is changed very little.
Therefore, we ended up using the pHash library, (and more specifically the Phashion gem created by Mike Perham), which generates a perceptual hash of an image for comparison.
A “perceptual hash” is a 64-bit value based on the discrete cosine transform of the image’s frequency spectrum data. Similar images will have hashes that are close in terms of Hamming distance. That is, a binary hash value of 1000 is closer to 0000 than 0011 because it only has one bit different whereas the latter value has two bits different. The duplicate threshold defines how many bits must be different between two hashes for the two associated images to be considered different images. Our testing showed that 15 bits is a good value to start with, it detected all duplicates with a minimum of false positives. - Mike Perham
So images that have a smaller Hamming distance are more similar. Before comparing the Hamming distance between the generated perceptual hashes, we first make some standardizing modifications to the images using ImageMagick. This is done to ensure that the images being compared are as similar as possible and to make up for any size differences. The checkerboard represents a transparent background.
Without making these standardizing modifications, the original images would have had a Hamming distance of 30. After the modifications, the Hamming distance is 12, which is under the recommended Hamming distance threshold.
From the highest scoring candidates passed into image analysis from text comparison, we finally associate the one with the most similar icon below the Hamming distance threshold to the base app in consideration.
This process is still undergoing refinement, but it has had pretty good performance so far.
We considered other approaches such as a weighted word frequency analysis on the title, publisher name, and description; but ultimately the weights were a bit difficult to calibrate to maximize accuracy.
For the image analysis step, we also considered comparing a grayscale and inverted version of the image to control for situations like Youtube iOS/Android (thankfully Youtube was matched with a perfect text analysis score), but we felt that it opened up too many possibilities for false positives for other icon comparisons we had not closely examined.
We also considered comparing the user rating count totals for candidate apps, but realized that it would fail at matching apps that were recently released in either store for at least a few weeks or months while the user bases caught up to each other.
All of the associations you see on Sensor Tower originated from this process (except for 6 that were done manually). If you have any suggestions for improving this process, feel free to email me at myron@sensortower.com. Also if you think you can do a better job, we're hiring!