Cheng-Yen (Wesley) Hsieh

I am a research scientist at Tiktok studying multi-modal language agents and embodied AI. I obtained my master degree in computer vision at Carnegie Mellon University, where I was advised by Prof. Deva Ramanan. My research experience lies in the fields of machine learning (ML) and computer vision (CV), including topics like self-supervised learning, amodal object tracking, and vision-language models. More specifically, my research pursuits are centered around the development of algorithms that enhance perceptual capabilities under challenging conditions, such as occlusion, leveraging minimal supervision and multimodal information.

Prior to my master journey, I received my B.S. from National Taiwan University. I had the pleasure to work on self-supervised representation learning with Prof. Yu-Chiang Frank Wang and federated learning with Prof. An-Yeu (Andy) Wu.

Github / Google Scholar / chengyenhsieh0806@gmail.com

news

Mar 4, 2024	Joined Tiktok as a research scientist in multi-modal language agents and embodied AI.
May 29, 2023	Joined Waymo as a Machine Learning Engineer Intern.
Aug 22, 2022	I joined CMU RI as a master student in computer vision.

selected publications

TAO-Amodal: A Benchmark for Tracking Any Object Amodally

Cheng-Yen Hsieh, Kaihua Chen, Achal Dave, Tarasha Khurana, and Deva Ramanan

in Submission, Nov 2023

Our solution to unravel occlusion scenarios for any object—amodal tracking.

Abs Page PDF Code

Amodal perception, the ability to comprehend complete object structures from partial visibility, is a fundamental skill, even for infants. Its significance extends to applications like autonomous driving, where a clear understanding of heavily occluded objects is essential. However, modern detection and tracking algorithms often overlook this critical capability, perhaps due to the prevalence of modal annotations in most benchmarks. To address the scarcity of amodal benchmarks, we introduce TAO-Amodal, featuring 833 diverse categories in thousands of video sequences. Our dataset includes amodal and modal bounding boxes for visible and partially or fully occluded objects, including those that are partially out of the camera frame. We investigate the current lay of the land in both amodal tracking and detection by benchmarking state-of-the-art modal trackers and amodal segmentation methods. We find that existing methods, even when adapted for amodal tracking, struggle to detect and track objects under heavy occlusion. To mitigate this, we explore simple finetuning schemes that can increase the amodal tracking and detection metrics of occluded objects by 2.1% and 3.3%.
Self-Supervised Pyramid Representation Learning for Multi-Label Visual Analysis and Beyond

Cheng-Yen Hsieh, Chih-Jung Chang, Fu-En Yang, and Yu-Chiang Frank Wang

IEEE WACV, 2023

One can easily adapt and fine-tune the models for a variety of applications including multi-label classification, object detection,and instance segmentation with this pre-training algorithm.

Abs PDF Code

While self-supervised learning has been shown to benefit a number of vision tasks, existing techniques mainly focus on image-level manipulation, which may not generalize well to downstream tasks at patch or pixel levels. Moreover, existing SSL methods might not sufficiently describe and associate the above representations within and across image scales. In this paper, we propose a Self-Supervised Pyramid Representation Learning (SS-PRL) framework. The proposed SS-PRL is designed to derive pyramid representations at patch levels via learning proper prototypes, with additional learners to observe and relate inherent semantic information within an image. In particular, we present a cross-scale patch-level correlation learning in SS-PRL, which allows the model to aggregate and associate information learned across patch scales. We show that, with our proposed SS-PRL for model pre-training, one can easily adapt and fine-tune the models for a variety of applications including multi-label classification, object detection, and instance segmentation.
C3-SL: Circular Convolution-Based Batch-Wise Compression for Communication-Efficient Split Learning

Cheng-Yen Hsieh, Yu-Chuan Chuang, and An-Yeu (Andy) Wu

IEEE 32nd International Workshop on Machine Learning for Signal Processing (MLSP), 2022

Split Learning (SL) for efficient image recognition through dimension-wise compression.

Abs PDF Code

Most existing studies improve the efficiency of Split learning (SL) by compressing the transmitted features. However, most works focus on dimension-wise compression that transforms high-dimensional features into a low-dimensional space. In this paper, we propose circular convolution-based batch-wise compression for SL (C3-SL) to compress multiple features into one single feature. To avoid information loss while merging multiple features, we exploit the quasi-orthogonality of features in high-dimensional space with circular convolution and superposition. To the best of our knowledge, we are the first to explore the potential of batch-wise compression under the SL scenario. Based on the simulation results on CIFAR-10 and CIFAR-100, our method achieves a 16x compression ratio with negligible accuracy drops compared with the vanilla SL. Moreover, C3-SL significantly reduces 1152x memory and 2.25x computation overhead compared to the state-of-the-art dimension-wise compression method.
FL-HDC: Hyperdimensional Computing Design for the Application of Federated Learning

Cheng-Yen Hsieh, Yu-Chuan Chuang, and An-Yeu (Andy) Wu

IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS), 2021

Highly efficienct image recognition under the federated learning (FL) scenario.

Abs PDF Code

Federated learning (FL) is a privacy-preserving learning framework, which collaboratively learns a centralized model across edge devices. Each device trains an independent model with its local dataset and only uploads model parameters to mitigate privacy concerns. However, most FL works focus on deep neural networks (DNNs), whose intensive computation hinders FL from practical realization on resource-limited edge devices. In this paper, we exploit the high energy efficiency properties of hyperdimensional computing (HDC) to propose a federated learning HDC (FL-HDC). In FL-HDC, we bipolarize model parameters to significantly reduce communication costs, which is a primary concern in FL. Moreover, we propose a retraining mechanism with adaptive learning rates to compensate for the accuracy degradation caused by bipolarization. Under the FL scenario, our simulation results show the effectiveness of our proposed FL-HDC across two datasets, MNIST and ISOLET. Compared with the previous work that transmits complete model parameters to the cloud, FL-HDC greatly reduces 23x and 9x communication costs with comparable accuracy in ISOLET and MNIST, respectively.