Tutorials
Tutorial 1 : Full day
ICDAR Tutorial on Private, Collaborative Learning in Document Analysis
For more info please click here
Organizers: Dimosthenis Karatzas, Rubèn Tito, Mohamed Ali Souibgui, Khanh Nguyen, Raouf Kerkouche, Kangsoo Jung, Marlon Tobaben, Joonas Jälkö, Vincent Poulain, Aurelie Joseph, Ernest Valveny, Josep Lladós, Antti Honkela, Catuscia Palamidessi, Mario Fritz
Contact: Dimosthenis Karatzas dimos@cvc.uab.es
Despite the fact that documents might be copyrighted and contain sensitive information, the Document Analysis and Recognition community has not yet incorporated training techniques and models that offer any privacy guarantees. This tutorial aims to introduce key aspects of collaborative learning and privacy-preserving machine learning techniques to the Document Analysis and Recognition community.
The need for this tutorial stems from two important observations. On one hand, much of the information contained in documents, especially in the administrative domain, contain sensitive information that needs to be protected. On the other hand, documents in real-life cannot be freely exchanged. More often than not, different entities have access to distinct sets of documents that cannot be shared due to legal or copyright reasons.
Responding to the two observations above, the tutorial will focus on two sets of techniques. On one hand, we will introduce key topics on federated learning methods that enable collaborative learning over the naturally distributed setting of document data that we encounter in real-life. On the other hand, we will introduce differential privacy techniques which allow models to be trained with strong privacy guarantees.
During the tutorial we will connect the above topics to Document Analysis and Recognition, highlighting the relevant benchmarks on this domain put forward by the European Lighthouse on Safe and Secure AI (ELSA).
Dimosthenis Karatzas is an associate professor at the Universitat Autònoma de Barcelona and associate director of the Computer Vision Centre (CVC) in Barcelona, Spain, where he leads the Vision, Language and Reading research group (http://vlr.cvc.uab.es). He has produced more than 140 publications on computer vision, reading systems and multimodal learning. He received the 2013 IAPR/ICDAR Young Investigator Award, a Google Research Award (2016) and two Amazon Machine Learning Research Awards (2019, 2022). He has set up two spin-off companies to date, TruColour Ltd, UK, in 2007 and AllRead, Spain, in 2019. Between 2018-19 he advised the Catalan government on the Catalan strategy of AI. He is a senior member of IEEE, a fellow of ELLIS and co-director of the ELLIS Unit Barcelona, past chair of IAPR TC11 (Reading Systems), and a member of the Artificial Intelligence Doctoral Academy (AIDA) Research and Industry Board. He created the Robust Reading Competition portal (https://rrc.cvc.uab.es), established as the de-facto international benchmark in document analysis and used by more than 45,000 registered researchers.
Mohamed Ali Souibgui is a postdoctoral researcher at Computer Vision Center, Barcelona, Spain. He received the Ph.D. degree in 2022 from the Universitat Autònoma de Barcelona (UAB), Spain. His research focuses on document image analysis using computer vision and machine learning tools.
Khanh Nguyen is currently a PhD student in the Computer Vision Center, Barcelona, Spain. His research focuses on machine learning methods for Vision-and-Language tasks, particularly exploring the role of context and incorporate it into the image interpretation pipeline.
Andrey Barsky is a postdoctoral researcher at the Computer Vision Center, Barcelona, Spain. He received his Ph.D. in 2015 from the University of Nottingham in the UK. His research focuses on computer vision and multimodal learning, as well as robustness and explainability in AI models.
Raouf Kerkouche is a Postdoctoral Fellow at the CISPA Helmholtz Center for Information Security advised by Prof. Mario Fritz. His current research centers around trustworthy machine learning with a focus on privacy and security. Raouf obtained his Ph.D. at INRIA, supervised by Prof. Claude Castelluccia and Prof. Pierre Genevès, where he worked on Differentially Private Federated Learning for Bandwidth and Energy Constrained Environments, with an interest in medical applications. One of his differentially private compression approaches published at UAI’21 has been included in a federated learning platform developed for drug discovery (https://www.melloddy.eu). He obtained his Master's degrees from Paris-Sud University and Pierre and Marie Curie University in France
Kangsoo Jung is working as a postdoctoral researcher at the COMETE team hosted Inria. He ie working under the supervision of Catuscia Palamidessi. He received the Ph.D. degree in 2017 from Sogang University in South Korea. His research focuses on differential privacy, machine learning and game theory to address the privacy-utility tradeoff.
Marlon Tobaben is a PhD student at the Department of Computer Science, University of Helsinki, supervised by Prof Antti Honkela and affiliated with the Finnish Centre of Artificial Intelligence (FCAI), a flagship of research excellence appointed by the Research Council of Finland. Marlon's research focuses on differentially private deep and federated learning.
Joonas Jälkö is a postdoctoral researcher in Professor Antti Honkela's group at the Department of Computer Science in University of Helsinki. His research focuses mainly on differential privacy applied on statistical inference and differentially private synthetic data.
Vincent Poulain d’Andecy is the head of the Yooz Research and Technologies Department since 2015. He is a graduate engineer of INSA Rennes and PhD of La Rochelle University. He started his career at ITESOFT in 1994 and has more than 25 years of experience in the development of Automatic Document Processing Systems. At Yooz, he is in charge of the AI developments with a 9-persons team, he supervises PhD and collaborative research projects in partnership with Academia like La Rochelle University and the CVC-CERCA.
Ernest Valveny is an Associate Professor at the Universitat Autònoma de Barcelona and also a researcher at the Computer Vision Center. He was the director of the Computer Science Department at UAB from 2013-2019. He he is a member of the Vision, Language and Reading research unit at CVC. His main research interests are computer vision, in particular text recognition and retrieval, document understanding and multimodal (vision and language) models. He has published more than 20 papers in international indexed journals and more than 100 papers in peer-reviewed international conferences. He has led a number of national and international research projects, as well as technology transfer contracts with companies, mainly related to document analysis and robust reading. He has served as a reviewer and member of the committee program for many of the most relevant international journals and conferences within the area of computer vision and pattern recognition.
Josep Lladós is an Associate Professor at the Computer Sciences Department of the Universitat Autònoma de Barcelona and a staff researcher of the Computer Vision Center, where he is also the director since January 2009. He is chair holder of Knowledge Transfer of the UAB Research Park and Santander Bank. He is the head of the Pattern Recognition and Document Analysis Group (2009SGR-00418). His current research fields are document analysis, structural and syntactic pattern recognition and computer vision. He has been the head of a number of Computer Vision R+D projects and published more than 200 papers in national and international conferences and journals
Antti Honkela is a Professor of Data Science (Machine Learning and AI) at the Department of Computer Science, University of Helsinki. He is the coordinating professor of Research Programme in Privacy-Preserving and Secure AI at the Finnish Center for Artificial Intelligence (FCAI), a flagship of research excellence appointed by the Research Council of Finland, and leader of the Privacy and infrastructures WP in European Lighthouse in Secure and Safe AI (ELSA), a European network of excellence in secure and safe AI. He serves in multiple advisory positions for the Finnish government in the privacy of health data. His research focuses on differentially private machine learning and statistical inference. He is an Action Editor of Transactions on Machine Learning Research and regularly serves as an area chair at leading machine learning conferences (NeurIPS, ICML, ICLR, AISTATS). He has taught the course Trustworthy Machine Learning including topics on privacy-preserving machine learning at the University of Helsinki since 2019.
Catuscia Palamidessi is Director of Research at INRIA Saclay (since 2002), where she leads the team COMETE. She has been a Full Professor at the University of Genova, Italy (1994-1997) and Penn State University, USA (1998-2002). Palamidessi's research interests include Privacy, Machine Learning, Fairness, Secure Information Flow, Formal Methods, and Concurrency. In 2019 she obtained an ERC advanced grant to conduct research on Privacy and Machine Learning. In 2022, she received the Grand Prix of the French Academy of Science. She has been PC chair of various conferences including LICS and ICALP, and PC member of more than 120 international conferences. She is on the Editorial board of several journals, including the IEEE Transactions in Dependable and Secure Computing, the ACM Transactions on Privacy and Security, Mathematical Structures in Computer Science, Theoretics, the Journal of Logical and Algebraic Methods in Programming, and Acta Informatica. She is serving on the Executive Committee of ACM SIGLOG, CONCUR, and CSL.
Mario Fritz is a faculty member at the CISPA Helmholtz Center for Information Security, an honorary professor at Saarland University, and a fellow of the European Laboratory for Learning and Intelligent Systems (ELLIS). Until 2018, he led a research group at the Max Planck Institute for Computer Science. Previously, he was a PostDoc at the International Computer Science Institute (ICSI) and UC Berkeley after receiving his PhD from TU Darmstadt and studying computer science at FAU Erlangen-Nuremberg. His research focuses on trustworthy artificial intelligence, especially at the intersection of information security and machine learning. He is Associate Editor of the journal "IEEE Transactions on Pattern Analysis and Machine Intelligence" (TPAMI) and has published over 100 articles in top conferences and journals. Currently, he is coordinating the Network of Excellence in AI "ELSA -- European Lighthouse on Secure and Safe AI" which is an ELLIS (https://ellis.eu) initiative that is funded by the EU and connects universities, research institutes, and industry partners across Europe (https://elsa-ai.eu)She has been PC chair of various conferences including LICS and ICALP, and PC member of more than 120 international conferences.
Tutorial 2 : Half day
Retrieval Augmented Generation (RAG): Bridging Document Analysis and Recognition with Large Language Models
Organizer: Falak Shah (Infocusp Innovations, India) | Contact: falak@infocusp.com
The introduction of large language models (LLMs) has brought about a transformative technique known as Retrieval Augmented Generation (RAG) for retrieving and querying information spread across documents. This tutorial will begin with an introduction to RAG, along with use cases from the industry and hands-on exercises for participants to gain in depth understanding of RAG pipeline components and libraries.
LLMs in general can answer generic queries about information stored within their weights. But how can they be used to accurately answer questions about information contained in any given database be it private or public? RAG provides the solution. It does this by allowing the generative AI system to ingest information from diverse sources such as databases, tables, and news feeds. Consequently, RAG empowers the LLM to deliver more timely, contextually appropriate, and accurate responses.
It follows a two step process where the first step vectorizes the information contained in the database for quick querying. The second step first finds the information relevant to any given query, and passes on that information as context to the LLM to generate the answers. There are a number of open source libraries available today (eg LlamaIndex/ Langchain) that simplify the implementation of RAG approaches for users to just a few lines of code. This tutorial would provide a hands-on experience to the participants to be able to apply RAG approaches to their own problem statements.
Today, RAG approaches are being utilized in a number of applications from question answering to summarizing reports to UX research. We will also be covering a few of the industrial applications in the case studies.
Falak Shah is currently the Chief Machine Learning Officer at InFoCusp Innovations. In his 8+ years at Infocusp, he has worked on a wide array of ML/ DL/ LLM projects - from financial time series modeling, real time object detection, CNNs, convex optimization, mapping EEG signals and audio, music generation using DL and neuro symbolic systems for scene understanding/ question answering, summarizing and querying large databases using LLMs. He has a number of patents/ publications and has conducted workshops/ tutorials at various top tier international conferences. He obtained his Master's degree in ICT (Gold medalist) from Dhirubhai Ambani Institute of Information and Communication Technology(DA-IICT) and Bachelor's degree in Electronics and Communication from Nirma University. He was part of the teams that won prizes in AI competitions on platforms such as Kaggle/ aicrowd.
Tutorial 3 : Half day – CANCELLED
Multi-modal Document Summarization in the era of LLMs & VLMs
Organizers: Tulika Saha, Raghav Jain, Sriparna Saha | Contact: sahatulika15@gmail.com
Document Summarization (DS) is an indispensable capability in the modern era of information abundance. For unimodal text documents, it serves as a vital filtering tool to determine relevance and extract key ideas from verbose articles, papers, or passages. Meanwhile, multi-modal document summarization (MDS) delivers additional contextualization and dimension by integrating data across visual, textual, and auditory channels.
Moreover, capitalizing on multiple modalities can solidify understanding, enhance memorability, and amplify impact—crucial considerations given modern attention spans. DS delivers immense value spanning domains and use cases ranging from news dissemination, scientific research, healthcare, government and legal settings amongst many others. Timeline summarization represents another emerging application, stitching together events, social media posts, and news documentation into an abridged chronological snapshot of an unfolding event. As summarization capabilities continue advancing within both unimodal and multi-modal contexts, spanning single or multiple documents, the applications across industries and disciplines appear extensive, with the potential to hugely augment knowledge discovery, sharing, and retention.
As information proliferation persists across modalities, DS has become instrumental for extracting meaningful signals from abundant noise. Bolstered by recent leaps in artificial intelligence through advances in LLMs and VLMs, the state of automatic MDS continues rapidly improving, evidencing aptitude for digesting extensive texts and multi-modal content into salient takeaways. Despite these advancements, current models face significant challenges that limit their effectiveness in real-world applications. These challenges include difficulties in processing long context lengths, where the ability to maintain coherence over extended narratives is often limited. Multi-lingualism presents another hurdle, as the capability to accurately summarize content across diverse languages and dialects remains an area requiring further development. As datasets expand and models evolve, the next generation of summarization systems, built on LLMs and VLMs, holds the promise of unlocking vast knowledge discovery. Therefore, advancing research in MDS is critical, as it stands to significantly influence how individuals and organizations assimilate, communicate, and utilize knowledge in this burgeoning era of artificial intelligence.
Through this tutorial, researchers and practitioners within the ICDAR community will find value in exploring and advancing techniques that leverage multiple modalities for more effective DS and analysis. Understanding MDS in the era of LLMs \& VLMs can have several implications and applications to the ICDAR community :
(i) /Enhanced Document Understanding/ in scenarios where documents contain a mix of text and visual information, such as technical papers, reports, or articles.
(ii) By summarizing documents using both textual and visual cues, the relevance of documents to user queries can be better assessed, leading to improved search results for /Information Retrieval/.
(iii) The ICDAR community often deals with challenges related to handwriting recognition, layout analysis, and document structure understanding. MDS can facilitate /cross-modal analysis/ by combining insights from different data types, aiding in tasks such as document classification, segmentation, and extraction.
Dr. Tulika Saha
Raghav Jain is an incoming research associate at the National Centre for Text Mining in the University of Manchester, under Prof. Sophia Ananiadou. Previously, he worked with the NLP Lab of IIT Patna, India under the guidance of Dr. Sriparna Saha and Prof. Pushpak Bhattacharyya. His research includes developing advanced AI-powered content moderation techniques, testing the boundaries of LLMs, and implementing AI across sectors including education, and healthcare. His work led to numerous publications in top-tier venues including EMNLP 2023, ACL 2023, ECML 2023, ACM MM 2023 etc.
Dr. Sriparna Saha
Tutorial 4 : Half day
Hands-On Deep Learning for Document Analysis
Tutorial material available here.
Organizer: Thomas M. Breuel (Nvidia, USA) | Contact: tmb@9×9.com
In recent years, there has been a resurgence in OCR, document analysis, and linguistic tasks on document databases, in large part due to the success of using transformers for large language models and OCR.
The tutorial will cover basic deep learning and data processing techniques for document analysis and LLM training. The objective is to give participants the basic practical tools for working with very large datasets in this research area:
- survey of available model architectures and pretrained models
- large scale document collections in PDF and image formats; Common Crawl
- training transformer models in PyTorch for LLM and OCR
- LLM fine tuning for document analysis related tasks
- WebDataset for training, managing, and transforming large datasets
- Ray for large scale distributed processing
- data augmentation and generation
- evaluation metrics and performance measurements
- Huggingface and libraries for LLMs
The tutorial will consist of a series of exercises in Jupyter Notebooks.
Prerequisites: basic knowledge of Python, PyTorch, and deep learning is recommended; participants are encouraged to set up a working PyTorch / Jupyter environment (laptop, remote desktop, and/or Google CoLab).
Thomas Breuel is a research scientist at NVIDIA, focusing on petascale deep learning, distributed learning tools, text recognition, and the relationship between deep learning and statistics. He has over 30 years of experience in machine learning and computer vision. Breuel's career includes a position as a research scientist at Google, where he was part of the Google Brain team working on machine learning, pattern recognition, and computer vision. He served as a professor of computer science and director of the IUPR Research Lab at the University of Kaiserslautern, Germany, leading research in pattern recognition, machine learning, and image understanding. His work at the University of Kaiserslautern involved collaborations with Google, Microsoft, Smiths Detection, Deutsche Telekom, and the BMBF. Before his academic tenure, Breuel was a member of the research staff at Xerox PARC, focusing on computer vision, pattern recognition, and document layout analysis. He developed the layout analysis technology behind UbiText and initiated the GroupFire project for collaborative and personalized Internet search methods. Additionally, he was a member of the research staff at IBM Almaden Research Center, where he contributed to IBM's DCS 2000 team for the Year 2000 US Census and was part of the team that developed QBIC, an early content-based multimedia image and video database retrieval system. Breuel earned his Ph.D. in Computational Neuroscience from MIT, where he researched geometric aspects of visual object recognition.