OUSD (R&E) CRITICAL TECHNOLOGY AREA(S): Advanced Computing and Software;Human-Machine Interfaces OBJECTIVE: The Defense Threat Reduction Agency (DTRA) seeks to develop AI/ML models and pipelines capable of identifying, extracting, and processing elements from scanned technical and scientific documents. This project aims to automate the extraction of tables, plots, photos, and other elements embedded within structured and unstructured text, ensuring high fidelity and accuracy in a production environment. DESCRIPTION: The Defense Threat Reduction Information Analysis Center (DTRIAC) houses a wealth of technical documents and multimedia related to national nuclear projects. Much of this information is still non-digital, and the digitized documents often suffer from structural inconsistencies. This project proposes the development of AI/ML models and pipelines to automate the extraction and processing of document elements, enhancing the fidelity of digitized information and increasing the throughput of scanning processes. The solution will comply with DoD and NIST guidelines, ensuring secure and efficient operations. Key Features: Identification and Extraction: Automatically identify and extract elements such as tables, plots, and photos from scientific documents. Additional Processing: Perform additional processing on extracted elements, such as transposing tables and identifying explicit and implicit axis values in plots. Automation: Ensure every step identification, extraction, and processing is fully automated and system agnostic. Deployment: Develop models and pipelines that are containerized, capable of being hardened, and ready for immediate deployment in a production environment. ATTRIBUTES DESIRED: Compatibility: The solution must ingest documents in common formats (e.g., PDF) and output data in standard formats (e.g., JSON, XML). Robustness: The models must handle digitized documents with readability artifacts and be stable for long-term use. Modularity: The solution should be modular, pipeline, and system agnostic, and interoperable with other modern software. Security: All computations must be performed locally with no off-premises resource access, ensuring data security and compliance with DoD and NIST standards. POTENTIAL MARKET: The primary customer for this technology is the Defense Threat Reduction Agency (DTRA). Additionally, the broader U.S. defense sector, including various military branches and defense contractors, can benefit from this technology by automating the extraction and processing of scientific and technical document elements. EXPECTED OUTCOMES: Standalone Pipeline: A fully operational pipeline capable of running in an air-gapped environment without internet connectivity. Comprehensive Data Extraction Tool: A tool that processes scientific documents, extracting and processing elements such as tables and plots, and providing the data in standard, machine-readable formats. PHASE I: PHASE I Proof of Concept: Objective: Define, develop, and determine the feasibility of AI/ML models for identifying, extracting, and processing elements from scientific documents. Required Efforts: 1. Literature Review and Requirements Analysis: Conduct a comprehensive review of existing technologies and methodologies related to document element and layout extraction. Identify specific requirements and constraints of DTRIAC's documents. 2. Development of Initial Models: Element Identification and Extraction: Develop AI/ML models to identify and extract elements like tables, plots, and photos. Additional Processing: Implement techniques to process extracted elements (e.g., transposing tables, identifying axis values). 3. Integration and Testing: Integrate the initial models into a cohesive proof-of-concept pipeline. Test the pipeline using a representative sample set of documents. Evaluate the performance of the models in terms of accuracy, efficiency, and usability. 4. Technical Report: Prepare a detailed technical report summarizing the proof of concept. Include results from initial testing, highlighting the performance of the models and identifying any areas for improvement. Propose a detailed plan for Phase II prototype development, outlining the steps required to refine and expand the models and pipeline. Deliverables: Technical Report: A comprehensive report detailing the literature review, development process, testing results, and Phase II plan. Proof-of-Concept Pipeline: An initial version of the pipeline demonstrating the feasibility of the proposed solution. PHASE II: PHASE II Prototype Development: Objective: Develop, demonstrate, and validate a fully functional prototype of the AI/ML data extraction and processing system, ready for deployment and evaluation. Required Efforts: 1. Refinement and Optimization: o Refine the AI/ML models based on feedback and results from Phase I. o Optimize the models for improved accuracy, efficiency, and scalability. 2. Prototype Development: o Develop a robust and scalable pipeline integrating refined models. o Ensure the pipeline can operate in a standalone, air-gapped environment without internet connectivity. o Implement server-side functionality to support user-side applications, ensuring compatibility with both traditional application-centric and on-premises cloud services environments. 3. Functional and Security Testing: o Conduct comprehensive functional testing to ensure the pipeline accurately identifies, extracts, and processes document elements. o Perform security testing to verify the pipeline's resilience against potential security threats and vulnerabilities. 4. User Interface and Usability Testing: o Develop a user-friendly interface to present the extracted information to end-users. o Conduct usability testing with potential end-users to gather feedback and make necessary improvements. 5. Documentation and Training: o Prepare detailed documentation covering the installation, operation, and maintenance of the prototype. o Provide training materials and sessions for end-users to ensure effective use of the system. Deliverables: Functional Prototype: A fully operational prototype ready for deployment and evaluation. Testing Reports: Comprehensive reports detailing the results of functional, security, and usability testing. Documentation: Complete user manuals, operational guides, and maintenance documentation. Training Materials: Training resources, including guides and sessions, to support end-users in utilizing the prototype effectively. PHASE III DUAL USE APPLICATIONS: PHASE III Prototype Refinement: Objective: Refine the AI/ML data extraction and processing system developed in Phase II. Apply the prototype to DTRA-specific scenarios. REFERENCES: 1. Paliwal, S. S., Vishwanath, D., Rahul, R., Sharma, M., & Vig, L. (2019, September). Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR) (pp. 128-133). IEEE. 2. Hashmi, K. A., Liwicki, M., Stricker, D., Afzal, M. A., Afzal, M. A., & Afzal, M. Z. (2021). Current status and performance analysis of table recognition in document images with deep neural networks. IEEE Access, 9, 87663-87685. 3. Kasem, M., Abdallah, A., Berendeyev, A., Elkady, E., Mahmoud, M., Abdalla, M., ... & Taj-Eddin, I. (2022). Deep learning for table detection and structure recognition: A survey. ACM Computing Surveys. 4. Ma, W., Zhang, H., Yan, S., Yao, G., Huang, Y., Li, H., ... & Jin, L. (2021, September). Towards an efficient framework for data extraction from chart images. In International Conference on Document Analysis and Recognition (pp. 583-597). Cham: Springer International Publishing. 5. Cliche, M., Rosenberg, D., Madeka, D., & Yee, C. (2017). Scatteract: Automated extraction of data from scatter plots. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18 22, 2017, Proceedings, Part I 10 (pp. 135-150). Springer International Publishing. 6. Shahira, K. C., Joshi, P., & Arakkal, L. (2023). Data Extraction and Question Answering on Chart Images Towards Accessibility and Data Interpretation. IEEE Open Journal of the Computer Society. 7. National Institute of Standards and Technology (NIST) Special Publication 800-53, Security and Privacy Controls for Information Systems and Organizations,https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-53r5.pdf 8. National Institute of Standards and Technology. (n.d.). Secure software development framework (SSDF). Retrieved June 5, 2024, fromhttps://csrc.nist.gov/Projects/ssdf; KEYWORDS: table extraction, chart extraction, document analysis, information extraction, automated classification review, artificial intelligence, natural language processing, machine learning