Recently, embedded systems, such as mobile platforms, have multiple processing units that can operate in parallel, centralized (CPUs) and neural (NPUs). We use deep-learning compilers to generate machine code optimized for these systems from a deep network (DNN). However, the proposed so far codes sequentially execute DNN operators on single unit or parallel graphic (GPUs). In this study, we pr...