You are on page 1of 42

Attack of the Clones: Detecting Cloned Applications on Android Markets

Jonathan Crussell1,2, Clint Gibler1, and Hao Chen1 1 University of California, Davis 2 Sandia National Labs Source: ESORICS 2012

Outline
Introduction Background Threat Model Clone Detection Approaches and Related Work Methodology Evaluation Case Studies Discussion Conclusion

Introduction
Much of the user experience of Android relies on third-party apps. Android has numerous marketplaces. Protect users from malicious apps. Protect developers from plagiarists.

Introduction
Developers can charge directly for their apps. Offer free apps that are ad-supported or contain in-game billing. Some apps have two version. Paid app cracked & release for free Free app cloned & change ad libraries

Introduction

Background
Android Markets Android Application Structure

Threat ModelDefinition of Clone.


Clones occur when two applications have similar code but have different ownership. Ignore Third-party libraries Multiple versions of the same application if they have the same ownership.

Resistance to Evasion Techniques.


High level modifications Method Restructurings Control Flow Alterations Addition/Deletion Reordering

Non Goals
Find cloning in native code. Determine which applications are the victims and which are clones.

Clone Detection ApproachesFeature Based


Feature based approaches analyze a program and extract a set of features. Number or size of classes, methods, loops, or variables to included libraries. Low detection rate or high false positive rate.

Clone Detection ApproachesStructure Based


Structure based systems convert programs into a stream of tokens and then compare the streams between two programs. More robustly than feature based systems. JPLAG, Winnowing and MOSS. Comparing DEX byte code streams could be a quite quick and scalable method to find exactly or near exactly copied code. But byte code streams contain no higher level semantic knowledge about the code.

Clone Detection ApproachesPDG Based


Program Dependence Graph: each node is a statement each edge shows a dependency between statements two types of dependencies: data and control A data dependency edge between statements 1 and 2 exists if there is a variable in 2 whose value depends on 1 . A control dependency between two statements exists if the truth value of the first statement controls whether the second statement executes.

Related Work
Androguard, DEXCD and DroidMOSS. All these approaches are structure based or structure based approximations. None of these tools use any semantic information to aid in detecting plagiarism.

Methodology

Selecting Potentially Cloned Applications


The goal of an application plagiarist is to entice unwary users to choose her cloned application instead of the original. Name and description.

Determining Application Similarity Based on Attributes


We use Solr to mimic the search engines on Android markets. Attributes of the apps: name, package, market, owner, and description

Constructing PDGs
dex2jar: Convert both apps code from the DEX format to a JAR. WALA: Construct PDGs for each method in every class of the applications. Only data dependency edges: More robust against statement reordering, insertion and deletion.

Comparing PDGs-Excluding Common Libraries


Ad library Admob, Facebook API, etc. Dumped both the package name and SHA-1 hash of known library files and recorded the most frequent SHA-1 hashes for each library.

Lossless and Lossy Filters


Lossless filter: Removes PDGs from consideration that are smaller than a specified size (< 10 nodes). Lossy filter: Calculate a frequency vector for each of the methods in the pair. This vector counts how many times a specific node type occurs in the PDG. Compare these two vectors using hypothesis testing (G-test).

Subgraph Isomorphism
Find a mapping between nodes in and nodes in . Subgraph isomorphism is NPComplete. VF2 algorithm.

Computing Similarity Scores


For each method (excluding the methods in known libraries) in application , let || be the number of nodes in this methods PDG. Find the best match of this PDG in s PDGs and denote it as (). Similarity score: () =
|()| ||

Evaluation
75,000 free apps from 13 Android markets. Randomly selected 9,400 pairs from the potential clones. Hadoop: parallelize DNADroid. HDFS: share data across a small cluster. The average throughput of DNADroid on this small cluster is 0.71 application pairs per minute.

Similarity between Applications

Similarity between Applications

Clustering Cloned Applications

Clustering Cloned Applications

Filter Performance

Filter Performance

Visual and Behavioral Verification

Case Studies

Benign Cloning
DNADroid found 30 pairs that both have a 100% similarity score. Translation.

Changes to Advertising Libraries


We can see when an application has most likely been cloned for monetary gain. Ex: XWind Downloader For the 141 apps, we found that 91 (65%) of these pairs had different libraries, all of which included changes to advertising libraries.

Malware Added to an Application


HippoSMS is a malicious application requires 10 permissions. It shares the same package name as a Chinese video player requires 11 permissions. 6 permissions that video player doesnt use.

Two Variants of the Same Malware


Two malicious apps that are identified by VirusTotal as being variants of the BaseBridge malware family. Both applications have been stripped of meaningful class and method names. DNADroid found coverages of 35% and 28% between the two variants.

Use of Freeware Cracking Tool in the Wild


AntiLVL Decompiling an app with baksmali Inserts a new file: SmaliHook.class And hide AntiLVLs modifications from the app itself by returning the original file size, MD5, and signatures. Android License Verification Library (LVL), Amazon Appstore DRM and Verizon DRM. 189 of 310 applications containing SmaliHook.class 235 of 310 containing references to AntiLVL in their signature files. Only 8% of our total apps were acquired from Chinese markets, 88% of the apps including AntiLVL traces were from Chinese markets.

Discussion

False Positive
Since it is a serious allegation to claim an application is a clone, we design DNADroid to have a very low false positive rate.

False Negative
Cloned applications often have similar attributes as the original. (?) There exist advanced program transformations that can evade PDGbased clone detection.

Comparison to Other Approaches


Androguard: miss 18% DEXCD had problems running on the pairs DNADroid identified. DroidMOSS is not currently publicly available.

Performance
DNADroid are more expensive but result in fewer false positives and false negatives.

Conclusion
DNADroid is a tool for finding clones on a large scale. We evaluated DNADroid on applications crawled from 13 Android markets. Identified at least 141 apps that have been cloned An additional 310 apps that were cracked with AntiLVL We describe five case studies DNADroid has a very low false positive rate DNADroid is an effective tool.

You might also like