Graduate and Postdoctoral Studies
Mining Natural APIs from Large Code Corpora using a Mixture of Hidden Markov Models
Monday, May 1, 2017
to 12:15 PM
DH 1049 Duncan Hall
A Natural API is a collection of API methods that tend to be used following certain discernible statistical patterns in real-world code. In this thesis, I present a method for learning an interpretable statistical model for such natural APIs. My model is trained on sequences of API calls produced from large software repositories through program analysis. Once trained, the model is able to recognize complex temporal dependences between methods, including methods that technically belong to different APIs, and can
be used as a proxy for formal correctness specifications.
Our experiments train the model on sequences of method calls generated from over 150 million lines of Android code. We evaluate the learned model by measuring accuracy in
learnt specifications from the corpus, completing code with missing API calls, and searching for code that uses APIs in a way that matches a query. Our encouraging results indicate that statistical models of API calls learned from large code corpora can have broad value in software engineering.