活动报名 | 大模型时代的语音处理研究研讨会暨CCF语音对话与听觉专委 “走进高校”系列活动—走进香港中文大学(深圳)
由中国计算机学会(CCF)主办,CCF语音对话与听觉专委会、香港中文大学(深圳)、深圳市大数据研究院及深圳市跨模态认知计算重点实验室承办的 “CCF走进高校”公益活动,将于2024年12月6日(星期五)9:00至18:00在香港中文大学(深圳)道远楼103举办。
本次活动由CCF语音对话与听觉专业委员会主任、上海交通大学俞凯教授,新加坡工程院院士、港中大(深圳)数据科学学院执行院长李海洲开场致辞。该研讨会邀请了党建武教授、颜永红教授、王文武教授、Chng Eng Siong教授、Junichi Yamagishi教授、Jinyu Li教授、吴志勇教授、武执政教授等学者发表精彩报告。李明教授、杜俊教授、陈谐教授等专家将就当前热点话题进行引导发言,并邀请参会专家展开专题讨论。活动将由港中大(深圳)数据科学学院的Satoshi Nakamura教授和武执政教授主持。
欢迎各位关注并积极参与,共同探讨前沿技术,分享学术成果。
活动安排
■ 活动时间
2024年12月6日(星期五)9:00至18:00
■ 地点
深圳市龙岗区龙翔大道2001号香港中文大学(深圳)道远楼103
■ 报名链接
可扫描下方二维码填写问卷报名。报名截止时间为12月3日(星期二)23:59。
■ 活动议程 / Agenda
■ 组织主席
李海洲
新加坡工程院院士
港中大(深圳)数据科学学院执行院长
研究领域:语音信息处理、自然语言处理、类脑计算、人机交互
个人简介:李海洲教授现任香港中文大学(深圳)数据科学学院执行院长、校长学勤讲座教授,同时他也是新加坡国立大学客座教授和德国不来梅大学卓越讲座教授。此前,他曾于2006年至2016年分别担任新加坡南洋理工大学和新加坡国立大学教授,于2009年担任东芬兰大学客座教授,于2011年至2016年任澳洲新南威尔士大学客座教授,于2003年至2016年担任新加坡科技研究局通信与资讯研究院首席科学家和研究总监。李教授曾任《IEEE/ACM 音频、语音和语言处理汇刊》主编 (2015-2018年)、IEEE语音与语言处理技术委员会委员 (2013-2015年)、IEEE信号处理学会出版委员会委员(2015-2018年)、IEEE 信号处理学会奖励委员会委员(2021-2023年)、IEEE 信号处理学会会议委员会委员(2023-2024)、IEEE信号处理学会副会长(2024-2026)。李教授也曾任国际语音通信学会主席 (2015-2017年)、亚太信号与信息处理协会主席 (2015-2016年)、亚洲自然语言处理联合会主席 (AFNLP, 2017-2018年)。此外,他还担任了多个大型学术会议的大会主席,包括ACL 2012、INTERSPEECH 2014, 和ICASSP 2022。李教授享誉国际,他不仅在语音识别和自然语言处理研究领域有着突出贡献,还领导研发了多项知名的语音产品,如1996年苹果电脑公司为Macintosh发行的中文听写套件、1999年Lernout & Hauspie公司为亚洲语言发行的Speech-Pen-Keyboard文本输入解决方案。他是一系列重大技术项目的架构师,项目包括2001年为新加坡樟宜国际机场研发的具有多语种语音识别功能的TELEFIQS自动呼叫中心、2012年为联想A586智能手机研发的声纹识别引擎、2013年为百度音乐研发的听歌识曲引擎。他的“音素集”研究成果解决了亚洲口语语言的语音识别的实用化问题,在美国和多个欧洲国家注册。
俞凯
CCF语音对话与听觉专业委员会主任
个人简介:上海交通大学计算机科学与工程系特聘教授,思必驰公司首席科学家。国家高层次人才计划获得者,NSFC优青,上海市“东方学者”特聘教授。长期从事智能语音及语言处理研究和产业转化工作,担任IEEE Speech and Language Processing Technical Committee 委员(2017-2019),IEEE/ACM Transactions on Audio Speech and Language Processing 副主编,InterSpeech等国际会议程序委员会主席,ACL、EMNLP等国际会议研究领域主席等。现任中国人工智能产业发展联盟学术和知识产权组组长,中国计算机学会语音对话及听觉专委会主任。
报告信息
*按照讲座时间顺序排序
颜永红
个人介绍:中国科学院声学所首席科学家,科学院核心骨干特聘研究员。1990年毕业于清华大学电子工程系获学士学位,1995年于美国俄勒冈研究院获计算机科学博士。长期从事语言声学方面的研究,牵头承担国家自然基金委重大项目、863、科技支撑、重点领域专项等国家级科研项目十余项,是新世纪百千万人才工程国家级人选、国家自然基金委国家杰出青年基金获得者、全国优秀科技工作者称号获得者。发表论文300余篇,持有发明专利100余项,曾获国家科技进步二等奖一项,中科院杰出科技成就奖一项,省部科技一等奖两项。
报告题目:人工智能时代的语言声学
报告摘要:语言声学是专门研究人类发音和听觉的声学学科分支之一。本报告将简单综述语言声学学科的历史和研究内容,并通过对“人和人对话交流中元音和辅音哪个更为重要”这一问题横跨近100年的研究综述来介绍相关基础研究成果及所发现机理应用于现实问题的历程,试图给当前大数据大计算大模型时代语音技术的发展探索新的途径。报告最后会简单介绍声学所团队在语言声学领域的一些最新进展。
王文武
个人介绍:Wenwu Wang is a Professor in Signal Processing and Machine Learning, University of Surrey, UK. He is also an AI Fellow at the Surrey Institute for People Centred Artificial Intelligence. His current research interests include signal processing, machine learning and perception, artificial intelligence, machine audition (listening), and statistical anomaly detection. He has (co)-authored over 300 papers in these areas. He has been recognized as a (co-)author or (co)-recipient of more than 15 accolades, including the 2022 IEEE Signal Processing Society Young Author Best Paper Award, ICAUS 2021 Best Paper Award, DCASE 2020 and 2023 Judge’s Award, DCASE 2019 and 2020 Reproducible System Award, and LVA/ICA 2018 Best Student Paper Award. He is an Associate Editor (2020-2025) for IEEE/ACM Transactions on Audio Speech and Language Processing, and an Associate Editor (2024-2026) for IEEE Transactions on Multimedia. He was a Senior Area Editor (2019-2023) and Associate Editor (2014-2018) for IEEE Transactions on Signal Processing. He is the elected Chair (2023-2024) of IEEE Signal Processing Society (SPS) Machine Learning for Signal Processing Technical Committee, a Board Member (2023-2024) of IEEE SPS Technical Directions Board, the elected Chair (2025-2027) and Vice Chair (2022-2024) of the EURASIP Technical Area Committee on Acoustic Speech and Music Signal Processing, an elected Member (2021-2026) of the IEEE SPS Signal Processing Theory and Methods Technical Committee. He has been on the organising committee of INTERSPEECH 2022, IEEE ICASSP 2019 & 2024, IEEE MLSP 2013 & 2024, and SSP 2009. He is Technical Program Co-Chair of IEEE MLSP 2025. He has been an invited Keynote or Plenary Speaker on more than 20 international conferences and workshops.
报告题目:Language Queried Audio Source Separation
报告摘要:Language-queried audio source separation (LASS) is a paradigm that we proposed recently for separating sound sources of interest from an audio mixture using a natural language query. The development of LASS systems offers intuitive and scalable interface tools that are potentially useful for digital audio applications, such as automated audio editing, remixing, and rendering. In this talk, we will first introduce the problem setting and motivation, making a connection with conventional paradigms including speech source separation and universal audio source separation. We then present our two newly developed LASS algorithms, AudioSep and FlowSep, respectively. AudioSep is a foundational model for open-domain audio source separation driven by natural language queries. It employs a query network and a separation network to predict time-frequency masks, enabling the extraction of target sounds based on text prompts. The model was trained on large-scale multimodal datasets and evaluated extensively on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. FlowSep is a new generative model for LASS based on rectified flow matching (RFM), which models linear flow trajectories from noise to target source features within the latent space of a variational autoencoder (VAE). During inference, the RFM-generated latent features are used to reconstruct a mel-spectrogram through the pre-trained VAE decoder, which is then passed to a pre-trained vocoder to synthesize the waveform. After this, we will discuss the datasets and performance metrics we developed for evaluating the LASS systems, and the organisation of Task 8 of DCASE 2024 international challenge, building on the AudioSep model. Finally, we conclude the talk by outlining potential future research directions in this area.
Junichi Yamagishi
个人介绍:Junichi Yamagishi received a Ph.D. degree from the Tokyo Institute of Technology (Tokyo Tech), Tokyo, Japan, in 2006. From 2007 to 2013, he was a research fellow at the Centre for Speech Technology Research, University of Edinburgh, U.K. He became an associate professor with the National Institute of Informatics, Japan, in 2013, where he is currently a professor. His research interests include speech processing, machine learning, signal processing, biometrics, digital media cloning, and media forensics. He was co-organizer for the bi-annual ASVspoof Challenge and the bi-annual Voice Conversion Challenge. He also served as a member of the IEEE Speech and Language Technical Committee from 2013 to 2019, as an Associate Editor for IEEE/ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING (TASLP) from 2014 to 2017, as a Senior Area Editor for IEEE/ACM TASLP from 2019 to 2023, as the chairperson for ISCA SynSIG from 2017 to 2021, and as a member at large of IEEE Signal Processing Society Education Board from 2019 to 2024.
He has authored more than 400 peer-reviewed papers in various international journals and conferences. Among his publications, a paper published at the 2018 IEEE international workshop on information forensics and security, titled “Mesonet: a compact facial video forgery detection network” has been cited over 1,500 times. He and his team also received the BTAS/IJCB 5-Year Highest Impact Award from the IEEE Biometrics Council at IEEE International Joint Conference on Biometrics (IJCB 2023) for the paper titled “Multi-task Learning for Detecting and Segmenting Manipulated Facial Images and Videos” which was published at IEEE International Conference on Biometrics: Theory, Applications, and Systems (BTAS) 2019.
报告题目:Automatic Fact Verification across Languages and Modality
报告摘要:Evidence-based fact-checking aims to automatically verify the veracity of input claims using evidences extracted from knowledge databases. In this talk, we present the details of our proposed automatic fact-checking model [1] and its multilingual extension to Spanish, French, Indonesian, Chinese and Japanese [2], as well as to a multimodal model that simultaneously uses structured tabular data as well as text [3]. Furthermore, a comparison between supervised trained models and LLM-based models with in-context learning [4] is presented and their advantages and disadvantages are discussed.
Jinyu Li
个人介绍:Jinyu Li received the B.E. and M.E. degrees in electrical engineering and information system from University of Science and Technology of China, Hefei, China, in 1997 and 2000, respectively. He received the Ph.D. degree in electrical and computer engineering from Georgia Institute of Technology, Atlanta, GA, USA in 2008. He currently serves as a Partner Applied Science Manager for Microsoft, Redmond, WA, USA since 2008 and leads a dynamic team dedicated to designing and enhancing speech modeling algorithms and technologies. Their aim is to ensure that Microsoft products maintain cutting-edge quality within the industry. From 2000 to 2003, he was a Researcher in the Intel China Research Center and Research Manager in iFlytek, China.
Dr. Li has been a member of IEEE Speech and Language Processing Technical Committee since 2017. He also served as the associate editor of IEEE/ACM Transactions on Audio, Speech and Language Processing from 2015 to 2020. He was awarded as the Industrial Distinguished Leader at Asia-Pacific Signal and Information Processing Association (APSIPA) in 2021 and APSIPA Sadaoki Furui Prize Paper Award in 2023.
报告题目:CTC guided modality matching for fast and accurate streaming speech translation
报告摘要:Models for streaming speech translation (ST) can achieve high accuracy and low latency if they're developed with vast amounts of paired audio in the source language and written text in the target language. Yet, these text labels for the target language are often pseudo labels due to the prohibitive cost of manual ST data labeling. In this work, we introduce a methodology named Connectionist Temporal Classification guided modality matching (CTC-GMM) that enhances the streaming ST model by leveraging extensive machine translation (MT) text data. This technique employs CTC to compress the speech sequence into a compact embedding sequence that matches the corresponding text sequence, allowing us to utilize matched {source-target} language text pairs from the MT corpora to refine the streaming ST model further. Our evaluations with FLEURS and CoVoST2 show that the CTC-GMM approach can increase translation accuracy relatively by 13.9% and 6.4\% respectively, while also boosting decoding speed by 59.7% on GPU.
吴志勇
个人介绍:吴志勇,清华大学深圳国际研究生院副研究员、博士生导师,中国计算机学会CCF语音对话与听觉专委副秘书长。研究兴趣为智能语音交互技术,曾获教育部、北京市、深圳市科技进步等奖励。获2023年度ICASSP语音信号质量增强挑战赛冠军、ICASSP Top 3%论文、中国多媒体大会最佳论文、INTERSPEECH最佳学生论文、CVPR Highlight论文等。获深圳市教学成果奖、清华大学良师益友荣誉称号、CCF语音对话与听觉专委卓越服务者奖励。
报告题目:基于大模型的语音细粒度理解与精细可控生成
报告摘要:语音是人类交互最自然和直接的方式,其不仅承载语义内容,更蕴含各种言外之意,如说话人音色、年龄、身份;音高、语速、能量等韵律特性;情感、态度、场景等副语言信息,给大模型的构建带来重大挑战。本报告将探讨基于大语言模型的语音细粒度理解与精细可控生成方法,旨在通过更准确的语音理解和更精确的可控生成,构建基于人工指令的精细可控的表现力语音生成大模型,为实现和谐自然的多模态交互提供新的解决方案。
党建武
个人介绍:Prof. Jianwu Dang is working for Shenzhen Institute of Advanced Technology,Chinese Academy of Science, China. He graduated from Tsinghua Univ., China, in 1982, and got his M.S. degree at the same university in 1984. He worked for Tianjin Univ. as a lecture from 1984 to 1988. He was awarded the PhD degree from Shizuoka Univ., Japan in 1992. He worked for ATR Human Information Processing Labs., Japan, as a senior researcher from 1992 to 2001. He joined the University of Waterloo, Canada, as a visiting scholar for one year from 1998. He worked for Japan Advanced Institute of Science and Technology (JAIST) as a professor from 2001-2022, who is an emeritus professor of JAIST. He joined the Institute of Communication Parlee (ICP), Center of National Research Scientific, France, as a research scientist the first class from 2002 to 2003. He has joined Tianjin University partially, Tianjin, China from 2009 to 2023. He joined Shenzhen Institute of Advanced Technology, Chinese Academy of Science, Shenzhen, China, from 2023. His research interests are in all the fields of speech science and technology, including speech signal processing, disorder speech and speech cognitive functions.
报告题目:Spoken language information processing based on the speech chain
报告摘要:This report will focus on how to apply the physiological and neurological knowledge to spoken language processing. We first introduce the work using the mechanism of speech production and perception in disordered speech processing. Then, we introduce the work using the neurological knowledge on speech recognition. We also some work that investigate the neuron mechanism in speech comprehension based on EEG.
武执政
个人介绍:武执政,香港中文大学(深圳)副教授、博导,国家级青年人才,连续多次入选斯坦福大学“全球前2%顶尖科学家。于南洋理工大学博士学位,曾在Meta(原Facebook)、苹果、爱丁堡大学、微软亚洲研究院等机构从事学术研究和技术领导工作。发起Merlin与Amphion开源系统、开源数据集Emilia,被超过300家单位采用并多次占据GitHub趋势榜榜首;组织语音伪造检测、语音合成、语音转换等国际评测,多次获最佳论文。IEEE/ACM TASLP、SPL等期刊编委、SLT2024大会主席。
报告题目:语音大模型的研究进展
报告摘要:语音包含丰富的信息,不仅仅是内容,还有副语言和环境信息。副语言信息包含了情感、口音、年龄等,而环境信息表达了语音所发生的场景信息。随着研究的深入和技术的进步,口语理解系统不仅需要理解文字信息,还需要识别和处理语音中的副语言信息和环境信息,从而具有良好的共情能力,使其不仅能听得清、听得懂,还能听出“人情味”。该报告将分享面向语音交互大模型的语音理解、高表现力具有零样本学习能力的语音生成大模型最新进展。
Chng Eng Siong
个人介绍:Dr. Chng Eng Siong is Professor with the College of Computing and Data Science (CCDS) at Nanyang Technological University (NTU) in Singapore. Prior to joining NTU in 2003, he worked at Knowles Electronics (USA), Lernout and Hauspie (Belgium), the Institute of Infocomm Research (I2R) in Singapore, and RIKEN in Japan. He received both a PhD and a BEng (Hons) from the University of Edinburgh, U.K., in 1996 and 1991, respectively, specializing in digital signal processing. His areas of expertise include machine learning, speech research, and applications of Large Language Models.
He currently serves as the Principal Investigator (PI) of the AI-Singapore Speech Lab from 2023 to 2025. Throughout his career, he has secured research grants from various institutions, including Alibaba ANGEL Lab, NTU-Rolls Royce, Mindef, MOE, and AStar. These grants, totaling over S$18 million, were awarded under the “Speech and Language Technology Program (SLTP)” in the School of Computer Science and Engineering (SCSE) at NTU. In recognition of his expertise, he was awarded the Tan Chin Tuan fellowship in 2007 to conduct research at Tsinghua University in Fang Zheng’s lab. Additionally, he received the JSPS travel grant award in 2008 to visit Tokyo Institute of Technology in Furui’s Lab.
He has supervised the graduation of over 19 PhD students and 13 Masters students. His publication record includes 2 edited books and over 200 journal and conference papers. Additionally, he has contributed to the academic community by serving as the publication chair for 5 international conferences, including Human Agent Interaction 2016, INTERSPEECH 2014, APSIPA-2010, APSIPA-2011, and ISCSLP-2006. Furthermore, he is in the organizing committee for ASRU 2019 (Singapore), ICAICTA 2024 (General Co-chair) and SLT 2024 (General Co-chair).
报告题目:Towards LLM-based ASR – experiences from NTU’s Speech Lab
报告摘要:The advent of large language models (LLMs) has revolutionized natural language processing, offering unprecedented capabilities in understanding, generating, and contextualizing text. Recent advances have enabled it for other modalities: such as audio, video and images.
Our focus in this talk is the integration of speech modality into LLMs. For this task, the research community has proposed various innovative approaches, e.g., applying discrete representations, integrating pre-trained ASR encoder to existing LLM decoder architectures (Qwen-Audio), multitask learning and multimodal pretraining
In this talk, I will discuss NTU’s recent approaches for this goal, specifically in the following 3 areas:
(1) “Hyporadise”: Applying LLM on N-best hypothesis generated by traditional ASR models to perform generative error correction, so as to generate more accurate transcription output;
(2) Extending Hyporadise to include acoustic and textual noise information during training to improve the robustness against noisy scenarios, even for low-SNR speech conditions.
(3) Using LLM for generative speech enhancement.
嘉宾介绍
主持人
NAKAMURA, Satoshi
港中大(深圳)数据科学学院教授
京都大学博士
研究领域:人工智能、口语语言处理
个人简介:Satoshi Nakamura曾担任奈良先端科学技术大学院大学(NAIST)教授和卡尔斯鲁厄理工学院名誉教授。他是IEEE会士、ISCA会士、lnformation Processing Society of Japan会士、Advanced Telecommunications Research Institute International (ATR)会士,以及IEEE信号处理学会成员(https://signalprocessingsociety.org/newsletter/2021/09/member-highlights-satoshi-nakamura)。他在1981年获京都工艺纤维大学学士学位,1992年获京都大学博士学位,1994-2000年担任NAIST信息科学研究生院副教授。2000-2004年和2005-2008年期间,他分别担任ATR口语传播研究实验室的系主任和主任,在2007-2008年担任ATR副校长。2009-2010年,他曾担任Keihanna研究实验室主任和日本国立信息通信技术研究所知识创造传播研究中心执行主任。2011年,他以正教授的身份进入NAIST。他在NAIST建立了数据科学中心,并于2017年至2021年在此中心担任主任。他现在是NAIST信息科学系正教授。2017-2021年,他担任日本理化学研究所旅游信息分析团队的团队负责人。他的研究兴趣包括口语处理建模和系统、语音处理、口语翻译、口语对话系统、自然语言处理和数据科学。他是语音翻译研究的世界领军人物之一,并一直致力于各种语音翻译研究项目,包括C-Star, A-Star和国际口语机器翻译评测比赛。他目前是International Speech Communication Association Special Interest Group: Spoken Language Translation的主席。他也为International Telecommunication Union网络语音翻译的标准化做出了贡献。他还是IEEE SLTC 2016-2018委员会成员。2012年至2019年,他是ISCA当选的董事会成员。他于2012年获得Antonio Zampolli Award。
嘉宾
李明
个人简介:昆山杜克大学电子与计算机工程教授,大数据研究中心研究员,武汉大学计算机学院兼职教授,博导。第十五批江苏省六大高峰B类高层级人才。2013年毕业于美国南加州大学电子工程系,获工学博士学位。2013-2017年任教于中山大学,副教授,博士生导师,2018年加入昆山杜克大学。曾任美国卡内基梅隆大学访问教授和美国杜克大学客座研究员,IEEE语音及语言技术委员会委员,多个国际期刊副主编,ASRU、Odyssey等重要学术会议技术程序委员会主席,带领团队十余次获得国际评测冠军,两次获得国际会议最佳论文奖。发表论文两百余篇,谷歌学术引用9600。2016年获IBM Faculty Award,2018年获ISCA最佳期刊论文奖,2020年获教育部高校科研优秀青年成果奖。
杜俊
个人简介:中国科学技术大学语音及语言信息处理国家工程研究中心副教授,博导。2009年-2013年就职于科大讯飞研究院和微软亚洲研究院,期间主导开发了语音识别、手写识别和OCR多个产品。研究方向是语音信号处理和模式识别应用,已发表论文的谷歌学术引用10000余次,获得2018年IEEE信号处理学会最佳论文奖、2023年国家科技进步一等奖、2022年吴文俊人工智能科技进步一等奖、2018年安徽省科技进步一等奖、2022年ISCSLP国际会议最佳论文奖,并同时入选2024年斯坦福全球前2%顶尖科学家“终身科学影响力排行榜”和“年度科学影响力排行榜”。目前是IEEE/INNS/CCF/CSIG高级会员,并担任IEEE SPL期刊编委、IEEE信号处理学会语音及语言处理技术委员会(SLTC)和音频及声学信号处理技术委员会(AASP-TC)委员、亚太信号与信息处理协会(APSIPA)语音语言音频分会技术委员会(SLA-TC)主任,曾担任IEEE-ACM TASLP期刊编委。
陈谐
个人简介:上海交通大学计算机科学与工程系副教授,博士生导师,获国家海外高层次人才(青年)项目资助。博士毕业于剑桥大学信息工程系,回国前先后在剑桥大学从事博士后研究,美国微软研究院任高级研究员和资深研究员。主要研究方向为深度学习,智能语音和声音信号处理,在本领域的国际权威会议和期刊发表论文90余篇。