SmartPagerankSearch / PRESENTATION_OUTLINE_EN.md
GitHub Action
Sync from GitHub Actions (Clean Commit)
7f22d3c

TUM Neural Knowledge Network - Presentation Outline

4-Minute Presentation Structure


🎯 Slide 1: Project Overview (30 seconds)

Title

TUM Neural Knowledge Network: Intelligent Knowledge Graph Search System

Core Positioning

  • Objective: Build a specialized knowledge search and graph system for Technical University of Munich
  • Features: Dual-space architecture + Intelligent crawler + Semantic search + Knowledge visualization

Technology Stack Overview

  • Backend: FastAPI + Qdrant Vector Database + CLIP Model
  • Frontend: React + ECharts + WebSocket real-time communication
  • Crawler: Intelligent recursive crawling + Multi-dimensional scoring system
  • AI: Google Gemini summarization + CLIP multimodal vectorization

πŸ—οΈ Slide 2: Core Innovation - Dual-Space Architecture (60 seconds)

Architecture Design Philosophy

Space X (Mass Information Repository)

  • Stores all crawled and imported content
  • Fast retrieval pool supporting large-scale data

Space R (Curated Reference Space - "Senate")

  • Curated collection of high-value, unique knowledge
  • Automatic promotion through "Novelty Detection"
  • Novelty Threshold: Similarity < 0.8 automatically promoted

Promotion Mechanism Highlights

1. Vector similarity detection
2. Automatic filtering of unique content (Novelty Threshold = 0.2)
3. Formation of high-quality knowledge core layer
4. Support for manual forced promotion

Advantages

  • βœ… Layered Management: Mass data + Curated knowledge
  • βœ… Automatic Filtering: Intelligent identification of high-quality content
  • βœ… Efficiency Boost: Search prioritizes Space R, then expands to Space X

πŸ•·οΈ Slide 3: Intelligent Crawler System Optimization (60 seconds)

Core Optimization Features

1. Deep Crawling Enhancement

  • Default depth: 8 layers (167% increase from 3 layers)
  • Adaptive expansion: High-quality pages can reach 10 layers
  • Path depth limit: High-quality URLs up to 12 layers

2. Link Priority Scoring System

Scoring Dimensions (Composite Score):
β”œβ”€ URL Pattern Matching (+3.0 points: /article/, /course/, /research/)
β”œβ”€ Link Text Content (+1.0 point: "learn", "read", "details")
β”œβ”€ Context Position (+1.5 points: content area > navigation)
└─ Path Depth Optimization (2-4 layers optimal, reduced penalty)

3. Adaptive Depth Adjustment

  • Page quality assessment (text block count, link count, title completeness)
  • Automatic depth increase for high-quality pages
  • Dynamic crawling strategy adjustment

4. Database Cache Optimization

  • Check if URL exists before crawling
  • Skip duplicate content, save 50%+ time
  • Store link information, support incremental updates

Performance Improvements

  • ⚑ Crawling depth increased 167% (3 layers β†’ 8 layers)
  • ⚑ Duplicate crawling reduced 50%+ (cache mechanism)
  • ⚑ High-quality content coverage increased 300%

πŸ” Slide 4: Hybrid Search Ranking Algorithm (60 seconds)

Multi-layer Ranking Mechanism

Layer 1: Vector Similarity Search

  • Semantic vectorization using CLIP model (512 dimensions)
  • Fast retrieval with Qdrant vector database
  • Cosine similarity calculation

Layer 2: Multi-dimensional Fusion Ranking

Final Score = w_sim Γ— Normalized Similarity + w_pr Γ— Normalized PageRank
            = 0.7 Γ— Semantic Similarity + 0.3 Γ— Authority Ranking

Layer 3: User Interaction Enhancement

  • InteractionManager: Track clicks, views, navigation paths
  • Transitive Trust: User navigation behavior transfers trust
    • If users navigate from A to B, B gains trust boost
  • Collaborative Filtering: Association discovery based on user behavior

Layer 4: Exploration Mechanism

  • 5% probability triggers exploration bonus (Bandit algorithm)
  • Randomly boost low-scoring results to avoid information bubbles

Special Features

1. Snippet Highlighting

  • Intelligent extraction of keyword context
  • Automatic keyword bold display
  • Multi-keyword optimized window selection

2. Graph View (Knowledge Graph Visualization)

  • ECharts force-directed layout
  • Center node + Related nodes + Collaborative nodes
  • Dynamic edge weights (based on similarity and user behavior)
  • Interactive exploration (click, drag, zoom)

πŸ“Š Slide 5: Wiki Batch Processing & Data Import (45 seconds)

XML Dump Processing System

Supported Formats

  • MediaWiki standard format
  • Wikipedia-specific format (auto-detected)
  • Wikidata format (auto-detected)
  • Compressed file support (.xml, .xml.bz2, .xml.gz)

Core Features

  • Automatic Wiki type detection
  • Parse page content and link relationships
  • Generate node CSV and edge CSV
  • One-click database import

Processing Optimization

  • Database cache checking (avoid duplicate imports)
  • Batch processing (supports large dump files)
  • Real-time progress feedback (WebSocket + progress bar)
  • Automatic link relationship extraction and storage

Upload Experience Optimization

  • Real-time upload progress bar (percentage, size, speed)
  • XMLHttpRequest progress monitoring
  • Beautiful UI design

πŸ’‘ Slide 6: Technical Highlights Summary (25 seconds)

Core Advantages Summary

  1. Dual-Space Intelligent Architecture - Mass data + Curated knowledge
  2. Deep Intelligent Crawler - 8-layer depth + Adaptive expansion + Cache optimization
  3. Hybrid Ranking Algorithm - Semantic search + PageRank + User interaction
  4. Knowledge Graph Visualization - Graph View + Relationship exploration
  5. Batch Data Processing - Wiki Dump + Auto-detection + Progress feedback
  6. Real-time Interactive Experience - WebSocket + Progress bar + Responsive UI

Performance Metrics

  • πŸ“ˆ Crawling depth increased 167%
  • πŸ“ˆ Duplicate processing reduced 50%+
  • πŸ“ˆ Search response time < 200ms
  • πŸ“ˆ Supports large-scale knowledge graphs (100K+ nodes)

🎬 Suggested Presentation Flow

  1. Opening (10 seconds): Project positioning and core value
  2. Dual-Space Architecture (60 seconds): Show system architecture diagram and promotion mechanism
  3. Intelligent Crawler (60 seconds): Show crawling depth and scoring system
  4. Search Ranking (60 seconds): Show Graph View and search results
  5. Wiki Processing (45 seconds): Show XML Dump upload and progress bar
  6. Summary (25 seconds): Core advantages and technical metrics

Total Duration: Approximately 4 minutes


πŸ“ Key Presentation Points

Visual Highlights

  • βœ… 3D particle network background (high-tech feel)
  • βœ… Graph View knowledge graph visualization
  • βœ… Real-time progress bar animation
  • βœ… Search result highlighting display

Technical Depth

  • βœ… Innovation of dual-space architecture
  • βœ… Multi-dimensional scoring algorithm
  • βœ… Hybrid ranking mechanism
  • βœ… User behavior learning system

Practical Value

  • βœ… Improve information retrieval efficiency
  • βœ… Automatic discovery of knowledge associations
  • βœ… Support large-scale data import
  • βœ… Real-time interactive experience

πŸ”§ Presentation Preparation Checklist

  • Prepare system architecture diagram (dual-space architecture)
  • Prepare Graph View demo screenshots
  • Prepare crawler scoring system examples
  • Prepare search ranking formula visualization
  • Prepare performance comparison data charts
  • Test Wiki Dump upload functionality
  • Prepare technology stack display diagram

πŸ“š Additional Notes

If Extending Presentation (6-8 minutes)

  • Add specific code examples
  • Show database query performance
  • Demonstrate user interaction tracking system
  • Show crawler cache optimization effects

If Simplifying Presentation (2-3 minutes)

  • Focus on dual-space architecture (40 seconds)
  • Focus on search ranking algorithm (60 seconds)
  • Quick Graph View demonstration (40 seconds)

πŸ’¬ FAQ Preparation

Q: Why use dual-space architecture? A: Mass data requires layered management. Space X stores everything, Space R curates high-quality content, improving search efficiency and result quality.

Q: How does the crawler avoid over-crawling? A: Multi-dimensional scoring system filters high-quality links, adaptive depth adjustment dynamically adjusts based on page quality, database cache avoids duplicate crawling.

Q: How does search ranking balance relevance and authority? A: Hybrid model with 70% similarity + 30% PageRank, combined with user interaction behavior, forms comprehensive ranking.

Q: How is Wiki Dump processing performance? A: Supports compressed files, batch processing, database cache checking, efficiently handles large dump files.


🎯 Presentation Tips

Opening Hook

Start with a compelling question: "How do we build an intelligent knowledge system that automatically organizes, searches, and visualizes massive amounts of academic information?"

Technical Depth vs. Clarity

  • Use visual diagrams for architecture
  • Show concrete examples (before/after comparisons)
  • Demonstrate live Graph View if possible
  • Highlight performance metrics with charts

Storytelling

  1. Problem: Managing and searching vast knowledge bases
  2. Solution: Dual-space architecture + intelligent algorithms
  3. Results: 167% depth improvement, 50%+ efficiency gain
  4. Impact: Scalable, intelligent knowledge network

Visual Aids Recommended

  • System architecture diagram (dual spaces)
  • Crawler depth comparison chart (3 β†’ 8 layers)
  • Graph View screenshot/video
  • Performance metrics dashboard
  • Technology stack diagram

Generated for TUM Neural Knowledge Network Presentation (English Version)