Towards deep learning models resistant to adversarial attacks A Madry, A Makelov, L Schmidt, D Tsipras, A Vladu ICLR 2018, 2018 | 14349 | 2018 |
Towards deep learning models resistant to adversarial attacks A Mądry, A Makelov, L Schmidt, D Tsipras, A Vladu stat 1050 (9), 2017 | 56 | 2017 |
Towards principled evaluations of sparse autoencoders for interpretability and control A Makelov, G Lange, N Nanda Secure and Trustworthy Large Language Models Workshop, ICLR 2024, 2024 | 21 | 2024 |
Rethinking backdoor attacks A Khaddaj*, G Leclerc*, A Makelov*, K Georgiev, H Salman, A Ilyas, ... International Conference on Machine Learning, 16216-16236, 2023 | 19 | 2023 |
Is this the subspace you are looking for? an interpretability illusion for subspace activation patching A Makelov, G Lange, N Nanda ICLR 2024, 2023 | 14 | 2023 |
Expansion in lifts of graphs A Makelov | 7 | 2015 |
Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task A Makelov ICML 2024 Workshop on Mechanistic Interpretability Spotlight, 2024 | 2 | 2024 |
Backdoor or Feature? A New Perspective on Data Poisoning A Khaddaj, G Leclerc, A Makelov, K Georgiev, A Ilyas, H Salman, A Madry | 2 | |
Evaluating Sparse Autoencoders for Controlling Open-Ended Text Generation A Makelov, N Monson, J Adebayo Second NeurIPS Workshop on Attributing Model Behavior at Scale, 2024 | | 2024 |
mandala: Compositional Memoization for Simple & Powerful Scientific Data Management A Makelov Python in Science (SciPy) Conference 2024, 2024 | | 2024 |