Spaces:

TUM
/

SmartPagerankSearch

Sleeping

App Files Files Community

SmartPagerankSearch / PRESENTATION_OUTLINE_EN.md

GitHub Action

Sync from GitHub Actions (Clean Commit)

7f22d3c 17 days ago

preview code

raw

history blame contribute delete

9.73 kB

	# TUM Neural Knowledge Network - Presentation Outline
	## 4-Minute Presentation Structure

	---

	## 🎯 Slide 1: Project Overview (30 seconds)

	### Title
	TUM Neural Knowledge Network: Intelligent Knowledge Graph Search System

	### Core Positioning
	- Objective: Build a specialized knowledge search and graph system for Technical University of Munich
	- Features: Dual-space architecture + Intelligent crawler + Semantic search + Knowledge visualization

	### Technology Stack Overview
	- Backend: FastAPI + Qdrant Vector Database + CLIP Model
	- Frontend: React + ECharts + WebSocket real-time communication
	- Crawler: Intelligent recursive crawling + Multi-dimensional scoring system
	- AI: Google Gemini summarization + CLIP multimodal vectorization

	---

	## 🏗️ Slide 2: Core Innovation - Dual-Space Architecture (60 seconds)

	### Architecture Design Philosophy

	Space X (Mass Information Repository)
	- Stores all crawled and imported content
	- Fast retrieval pool supporting large-scale data

	Space R (Curated Reference Space - "Senate")
	- Curated collection of high-value, unique knowledge
	- Automatic promotion through "Novelty Detection"
	- Novelty Threshold: Similarity < 0.8 automatically promoted

	### Promotion Mechanism Highlights
	```
	1. Vector similarity detection
	2. Automatic filtering of unique content (Novelty Threshold = 0.2)
	3. Formation of high-quality knowledge core layer
	4. Support for manual forced promotion
	```

	### Advantages
	- ✅ Layered Management: Mass data + Curated knowledge
	- ✅ Automatic Filtering: Intelligent identification of high-quality content
	- ✅ Efficiency Boost: Search prioritizes Space R, then expands to Space X

	---

	## 🕷️ Slide 3: Intelligent Crawler System Optimization (60 seconds)

	### Core Optimization Features

	1. Deep Crawling Enhancement
	- Default depth: 8 layers (167% increase from 3 layers)
	- Adaptive expansion: High-quality pages can reach 10 layers
	- Path depth limit: High-quality URLs up to 12 layers

	2. Link Priority Scoring System
	```
	Scoring Dimensions (Composite Score):
	├─ URL Pattern Matching (+3.0 points: /article/, /course/, /research/)
	├─ Link Text Content (+1.0 point: "learn", "read", "details")
	├─ Context Position (+1.5 points: content area > navigation)
	└─ Path Depth Optimization (2-4 layers optimal, reduced penalty)
	```

	3. Adaptive Depth Adjustment
	- Page quality assessment (text block count, link count, title completeness)
	- Automatic depth increase for high-quality pages
	- Dynamic crawling strategy adjustment

	4. Database Cache Optimization
	- Check if URL exists before crawling
	- Skip duplicate content, save 50%+ time
	- Store link information, support incremental updates

	### Performance Improvements
	- ⚡ Crawling depth increased 167% (3 layers → 8 layers)
	- ⚡ Duplicate crawling reduced 50%+ (cache mechanism)
	- ⚡ High-quality content coverage increased 300%

	---

	## 🔍 Slide 4: Hybrid Search Ranking Algorithm (60 seconds)

	### Multi-layer Ranking Mechanism

	Layer 1: Vector Similarity Search
	- Semantic vectorization using CLIP model (512 dimensions)
	- Fast retrieval with Qdrant vector database
	- Cosine similarity calculation

	Layer 2: Multi-dimensional Fusion Ranking
	```python
	Final Score = w_sim × Normalized Similarity + w_pr × Normalized PageRank
	= 0.7 × Semantic Similarity + 0.3 × Authority Ranking
	```

	Layer 3: User Interaction Enhancement
	- InteractionManager: Track clicks, views, navigation paths
	- Transitive Trust: User navigation behavior transfers trust
	- If users navigate from A to B, B gains trust boost
	- Collaborative Filtering: Association discovery based on user behavior

	Layer 4: Exploration Mechanism
	- 5% probability triggers exploration bonus (Bandit algorithm)
	- Randomly boost low-scoring results to avoid information bubbles

	### Special Features

	1. Snippet Highlighting
	- Intelligent extraction of keyword context
	- Automatic keyword bold display
	- Multi-keyword optimized window selection

	2. Graph View (Knowledge Graph Visualization)
	- ECharts force-directed layout
	- Center node + Related nodes + Collaborative nodes
	- Dynamic edge weights (based on similarity and user behavior)
	- Interactive exploration (click, drag, zoom)

	---

	## 📊 Slide 5: Wiki Batch Processing & Data Import (45 seconds)

	### XML Dump Processing System

	Supported Formats
	- MediaWiki standard format
	- Wikipedia-specific format (auto-detected)
	- Wikidata format (auto-detected)
	- Compressed file support (.xml, .xml.bz2, .xml.gz)

	Core Features
	- Automatic Wiki type detection
	- Parse page content and link relationships
	- Generate node CSV and edge CSV
	- One-click database import

	Processing Optimization
	- Database cache checking (avoid duplicate imports)
	- Batch processing (supports large dump files)
	- Real-time progress feedback (WebSocket + progress bar)
	- Automatic link relationship extraction and storage

	### Upload Experience Optimization
	- Real-time upload progress bar (percentage, size, speed)
	- XMLHttpRequest progress monitoring
	- Beautiful UI design

	---

	## 💡 Slide 6: Technical Highlights Summary (25 seconds)

	### Core Advantages Summary

	1. Dual-Space Intelligent Architecture - Mass data + Curated knowledge
	2. Deep Intelligent Crawler - 8-layer depth + Adaptive expansion + Cache optimization
	3. Hybrid Ranking Algorithm - Semantic search + PageRank + User interaction
	4. Knowledge Graph Visualization - Graph View + Relationship exploration
	5. Batch Data Processing - Wiki Dump + Auto-detection + Progress feedback
	6. Real-time Interactive Experience - WebSocket + Progress bar + Responsive UI

	### Performance Metrics
	- 📈 Crawling depth increased 167%
	- 📈 Duplicate processing reduced 50%+
	- 📈 Search response time < 200ms
	- 📈 Supports large-scale knowledge graphs (100K+ nodes)

	---

	## 🎬 Suggested Presentation Flow

	1. Opening (10 seconds): Project positioning and core value
	2. Dual-Space Architecture (60 seconds): Show system architecture diagram and promotion mechanism
	3. Intelligent Crawler (60 seconds): Show crawling depth and scoring system
	4. Search Ranking (60 seconds): Show Graph View and search results
	5. Wiki Processing (45 seconds): Show XML Dump upload and progress bar
	6. Summary (25 seconds): Core advantages and technical metrics

	Total Duration: Approximately 4 minutes

	---

	## 📝 Key Presentation Points

	### Visual Highlights
	- ✅ 3D particle network background (high-tech feel)
	- ✅ Graph View knowledge graph visualization
	- ✅ Real-time progress bar animation
	- ✅ Search result highlighting display

	### Technical Depth
	- ✅ Innovation of dual-space architecture
	- ✅ Multi-dimensional scoring algorithm
	- ✅ Hybrid ranking mechanism
	- ✅ User behavior learning system

	### Practical Value
	- ✅ Improve information retrieval efficiency
	- ✅ Automatic discovery of knowledge associations
	- ✅ Support large-scale data import
	- ✅ Real-time interactive experience

	---

	## 🔧 Presentation Preparation Checklist

	- [ ] Prepare system architecture diagram (dual-space architecture)
	- [ ] Prepare Graph View demo screenshots
	- [ ] Prepare crawler scoring system examples
	- [ ] Prepare search ranking formula visualization
	- [ ] Prepare performance comparison data charts
	- [ ] Test Wiki Dump upload functionality
	- [ ] Prepare technology stack display diagram

	---

	## 📚 Additional Notes

	### If Extending Presentation (6-8 minutes)
	- Add specific code examples
	- Show database query performance
	- Demonstrate user interaction tracking system
	- Show crawler cache optimization effects

	### If Simplifying Presentation (2-3 minutes)
	- Focus on dual-space architecture (40 seconds)
	- Focus on search ranking algorithm (60 seconds)
	- Quick Graph View demonstration (40 seconds)

	---

	## 💬 FAQ Preparation

	Q: Why use dual-space architecture?
	A: Mass data requires layered management. Space X stores everything, Space R curates high-quality content, improving search efficiency and result quality.

	Q: How does the crawler avoid over-crawling?
	A: Multi-dimensional scoring system filters high-quality links, adaptive depth adjustment dynamically adjusts based on page quality, database cache avoids duplicate crawling.

	Q: How does search ranking balance relevance and authority?
	A: Hybrid model with 70% similarity + 30% PageRank, combined with user interaction behavior, forms comprehensive ranking.

	Q: How is Wiki Dump processing performance?
	A: Supports compressed files, batch processing, database cache checking, efficiently handles large dump files.

	---

	## 🎯 Presentation Tips

	### Opening Hook
	Start with a compelling question: "How do we build an intelligent knowledge system that automatically organizes, searches, and visualizes massive amounts of academic information?"

	### Technical Depth vs. Clarity
	- Use visual diagrams for architecture
	- Show concrete examples (before/after comparisons)
	- Demonstrate live Graph View if possible
	- Highlight performance metrics with charts

	### Storytelling
	1. Problem: Managing and searching vast knowledge bases
	2. Solution: Dual-space architecture + intelligent algorithms
	3. Results: 167% depth improvement, 50%+ efficiency gain
	4. Impact: Scalable, intelligent knowledge network

	### Visual Aids Recommended
	- System architecture diagram (dual spaces)
	- Crawler depth comparison chart (3 → 8 layers)
	- Graph View screenshot/video
	- Performance metrics dashboard
	- Technology stack diagram

	---

	Generated for TUM Neural Knowledge Network Presentation (English Version)