We propose a model for hierarchical structured data as an extension to the stochastic temporal convolutional network. The proposed combines autoregressive with variational autoencoder and downsampling achieve superior computational complexity. evaluate on two different types of sequential data: speech handwritten text. results are promising achieving state-of-the-art performance.