CURATORIAL STATEMENT 策展论述

Lying Sophia &
Mocking Alexa



︎



Text HE Di 文 贺笛




In recent years, deep learning has pushed the limits of many real applications, including speech recognition [1], image classification [2], and machine translation [3]. Deep neural network-based models have even achieved super-human performance in many challenging game environments such as Go [4], StarCraft [5] and Dota2 [6]. The keys to the success of deep learning span in many aspects including advanced neural network architectures [2,3], modern optimization algorithms [7], massive data and huge computational power [4,6].

In this project, we mainly leverage deep learning models in natural language processing. The conversations are generated in three steps: conditional sentence generation for the English version, text-to-text translation from English to Chinese and text-to-speech translation. We briefly introduce the basic knowledge of the deep learning models we used as below.

In the conditional sentence generation step, we use the GPT-2 model [8] which is the current state-of-the-art language generation model based on the Transformer architecture [3,9]. The model is trained to predict the distribution of the next word conditioned on its proceeding words in a sentence using 8 million English web documents, which roughly corresponds to 40 GB plain texts. As the model can predict proper words given any context, we can use it to generate a sentence word by word autoregressively.

We use the open-sourced GPT-2 medium model which contains 330 millions of parameters. In particular, for Alexa and Sophia, we feed the GPT-2 model with hand-craft sentence beginnings. For example, we create a sentence beginning ``Alexa can help human’’ and use the GPT-2 model to generate a sentence automatically from it. Note that the neural language model is a probabilistic generative model, we can sample different outputs in different rounds.  For each sentence beginning for Alexa and Sophia, we randomly sample 512 sentences and follow to use the suggested hyperparameter in [8]. We set the temperature to be 1.0, set the top-k number to be 40 to balance accuracy and diversity and set the maximum sentence length to be 128. We create 81 different sentence beginnings for Alexa and Sophia and finally obtain 80,000 sentences with 1000,000 words in total. We randomly organize the sentences from Alexa and Sophia and form them into conversations.

Given the generated English contexts, we translate each sentence from English to Chinese using Google Translator. As far as we know, Google Translator uses the Transformer model trained from millions of bilingual sentences of the two languages. Generally speaking, given a sentence in English, the Transformer encoder will first encode the sentence into contexts which are usually real-valued vectors. Then the Transformer decoder will decode the encoded contexts using stack of attentive layers and generate the word sequence in Chinese. In the last step, we translate the texts into voices using APIs from iFLYTEK.

[1]. Hinton, Geoffrey, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior et al. "Deep neural networks for acoustic modeling in speech recognition." IEEE Signal processing magazine 29 (2012).

[2]. He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." CVPR 2016.

[3]. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." NIPS 2017.

[4]. AlphaGo, https://deepmind.com/research/alphago/, DeepMind, 2017.

[5]. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II, https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/, DeepMind, 2019.

[6]. OpenAI Five. https://openai.com/five/, OpenAI, 2019.

[7]. Du, Simon S., Jason D. Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. ICML 2019.

[8]. Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. "Language models are unsupervised multitask learners." OpenAI Blog 1, no. 8 (2019).

[9]. Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, Tie-yan Liu. Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View. arXiv preprint:1906.02762