BERT を中心に解説した資料です.BERT に比べると,XLNet と RoBERTa の内容は詳細に追ってないです.
あと,自作の図は上から下ですが,引っ張ってきた図は下から上になっているので注意してください.
もし間違い等あったら修正するので,言ってください.
(特に,RoBERTa の英語を読み間違えがちょっと怖いです.言い訳すいません.)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
XLNet: Generalized Autoregressive Pretraining for Language Understanding
RoBERTa: A Robustly Optimized BERT Pretraining Approach
cvpaper.challenge の Meta Study Group 発表スライド
cvpaper.challenge はコンピュータビジョン分野の今を映し、トレンドを創り出す挑戦です。論文サマリ・アイディア考案・議論・実装・論文投稿に取り組み、凡ゆる知識を共有します。2019の目標「トップ会議30+本投稿」「2回以上のトップ会議網羅的サーベイ」
http://xpaperchallenge.org/cv/
1. The document discusses probabilistic modeling and variational inference. It introduces concepts like Bayes' rule, marginalization, and conditioning.
2. An equation for the evidence lower bound is derived, which decomposes the log likelihood of data into the Kullback-Leibler divergence between an approximate and true posterior plus an expected log likelihood term.
3. Variational autoencoders are discussed, where the approximate posterior is parameterized by a neural network and optimized to maximize the evidence lower bound. Latent variables are modeled as Gaussian distributions.
PFN福田圭祐による東大大学院「融合情報学特別講義Ⅲ」(2022年10月19日)の講義資料です。
・Introduction to Preferred Networks
・Our developments to date
・Our research & platform
・Simulation ✕ AI
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
1. The document discusses probabilistic modeling and variational inference. It introduces concepts like Bayes' rule, marginalization, and conditioning.
2. An equation for the evidence lower bound is derived, which decomposes the log likelihood of data into the Kullback-Leibler divergence between an approximate and true posterior plus an expected log likelihood term.
3. Variational autoencoders are discussed, where the approximate posterior is parameterized by a neural network and optimized to maximize the evidence lower bound. Latent variables are modeled as Gaussian distributions.
PFN福田圭祐による東大大学院「融合情報学特別講義Ⅲ」(2022年10月19日)の講義資料です。
・Introduction to Preferred Networks
・Our developments to date
・Our research & platform
・Simulation ✕ AI
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
34. 参考文献
[1] Pedro Domingos and Geoff Hulten. Mining high-speed data streams. In Proceedings of the
sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp.
71–80. ACM, 2000.
[2] Albert Bifet and Ricard Gavaldà. Adaptive learning from evolving data streams. In International
Symposium on Intelligent Data Analysis, pp. 249–260. Springer, 2009.
[3] Nikunj Oza and Stuart Russell. Online bagging and boosting. In Proc. Artif. Intell. Statist., 2005,
pp. 105–112.
[4] Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard Gavaldà. New
ensemble methods for evolving data streams. In Proceedings of the 15th ACM SIGKDD
international conference on Knowledge discovery and data mining, pp. 139–148. ACM, 2009.
[5] Paul Utgoff. Decision tree induction based on efficient tree restructuring, Tech. Rept. 05-18,
University of Massachusetts, Amherst, MA, 1995.
[6] Ross Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann Publishers, 1993.
[7] Leo Breiman, Jerome Friedman, Richard Olshen, and C. J. Stone. Classifcation and
Regression Trees. Wadsworth, 1984.
[8] Albert Bifet and Ricard Gavaldà. Learning from time-changing data with adaptive windowing.
In Proceedings of the 2007 SIAM international conference on data mining, pp. 443–448.
SIAM, 2007.
33