During past several years, there has been a trend that the modern mobile SoC (system-on-chip) chipsets start to incorporate in one single chip the functionality of several general purpose processors and application-specific accelerators to reduce the cost, the power consumption and the communication overhead. Given the ever-growing performance requirements and strict power constraints, the existence of different types of signal processing workloads have posed challenges to the mapping of the computationally-intensive algorithms to the heterogeneous architecture of the mobile SoCs. Many such signal processing workloads such as channel decoding for wireless communication modem and mobile computer vision applications have high computational complexity, which requires accelerators implemented with parallel algorithms and architectures to meet the performance requirements. Partitioning the workloads and deploying them with the appropriate components of mobile chipsets are crucial to fully utilize the mobile SoC's heterogeneous architecture.
The goal of this thesis is to study parallel algorithms and architecture of high performance signal processing accelerators for several representative application workloads in wireless communication and mobile computing systems. We explore the design space of the parallel algorithms and architectures and highlight the workload partitioning and architecture-aware optimization schemes including algorithmic optimization, data structure optimization, and memory access optimization to improve the throughput performance and hardware (or energy) efficiency. As case studies, we will first propose contention-free interleaver architecture for parallel turbo decoding, which enables high throughput multi-standard turbo decoding ASIC (application-specific integrated circuit) with efficient hardware. Secondly, we propose massively parallel LDPC (low-density parity-check) decoding algorithm and implementation using GPU (graphics processor unit), which leads to high throughput and low latency LDPC decoding for practical SDR (software-defined radio) systems. Furthermore, we take advantage of the heterogeneous mobile CPU and GPU to accelerate representative mobile computer vision algorithms such as image editing and local feature extraction algorithms. Based on algorithm analysis and experimental results from the above case studies, we finally explore the design space and compare the performance of accelerator architectures for wireless communication and mobile vision use cases. We will show that the heterogeneous architecture of mobile systems is the key to efficiently accelerating parallel algorithms in order to meet the growing requirements of performance, efficiency, and flexibility.