{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a0000001",
   "metadata": {},
   "source": [
    "# 1 – Indexing: Dokumente in eine Vektordatenbank + BM25-Index überführen\n",
    "\n",
    "Dieses Notebook führt die **Indexing-Pipeline** durch:\n",
    "\n",
    "1. **Laden** – PDFs und Word-Dokumente werden eingelesen\n",
    "2. **Chunking** – Der Text wird in überlappende Abschnitte zerlegt\n",
    "3. **Embedding** – Jeder Chunk wird in einen Vektor umgewandelt → ChromaDB\n",
    "4. **BM25-Index** – Die gleichen Chunks werden für lexikalische Suche gespeichert\n",
    "\n",
    "> **Neu in dieser Version:** Neben der Vektordatenbank wird ein **zweiter, lexikalischer Index**\n",
    "> (BM25) erstellt. Beide Indizes basieren auf denselben Chunks, nutzen aber\n",
    "> grundlegend unterschiedliche Suchverfahren – das ist die Basis für *Hybrid Search*.\n",
    "\n",
    "> **Hinweis:** Dieses Notebook muss nur einmal (oder bei neuen Dokumenten) ausgeführt werden.  \n",
    "> Das RAG-Notebook (Notebook 3) liest dann beide Indizes."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0000002",
   "metadata": {},
   "source": [
    "## Imports & Konfiguration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "b640ed5a",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/thomasjorg/Documents/05_VSCode/05_Python/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import glob\n",
    "import shutil\n",
    "import pickle\n",
    "\n",
    "from langchain_community.document_loaders import PyMuPDFLoader, Docx2txtLoader\n",
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "from langchain_huggingface import HuggingFaceEmbeddings\n",
    "from langchain_community.vectorstores import Chroma"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b640ed5b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- KONFIGURATION ---\n",
    "\n",
    "# Pfade\n",
    "PDF_SOURCE_DIR = \"./documents\"      # Ordner mit den Quell-Dokumenten (PDF + DOCX)\n",
    "DB_DIR         = \"./chroma_db\"       # Speicherort der Vektordatenbank\n",
    "BM25_DIR       = \"./bm25_index\"      # NEU: Speicherort für den BM25-Index\n",
    "MODEL_PATH     = \"./models\"          # Lokaler Cache für das Embedding-Modell\n",
    "\n",
    "# Embedding-Modell\n",
    "EMBEDDING_MODEL = \"intfloat/multilingual-e5-large-instruct\"\n",
    "\n",
    "# Chunking-Parameter\n",
    "CHUNK_SIZE    = 512\n",
    "CHUNK_OVERLAP = 128   # ~25 % Überlappung – wichtig bei langen deutschen Sätzen"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0000003",
   "metadata": {},
   "source": [
    "## Embedding-Modell vorbereiten\n",
    "\n",
    "Das Modell `multilingual-e5-large-instruct` erwartet **spezifische Präfixe**:  \n",
    "- Dokumente (beim Indexieren): `\"passage: ...\"`  \n",
    "- Suchanfragen (beim Retrieval): `\"query: ...\"`  \n",
    "\n",
    "Ohne diese Präfixe arbeitet das Modell deutlich unter seinem Niveau.  \n",
    "Da LangChain diese Präfixe nicht automatisch setzt, bauen wir einen dünnen Wrapper:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "b640ed5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "class E5Embeddings(HuggingFaceEmbeddings):\n",
    "    \"\"\"Wrapper, der die vom E5-Modell erwarteten Präfixe automatisch setzt.\"\"\"\n",
    "\n",
    "    def embed_documents(self, texts: list[str]) -> list[list[float]]:\n",
    "        return super().embed_documents([\"passage: \" + t for t in texts])\n",
    "\n",
    "    def embed_query(self, text: str) -> list[float]:\n",
    "        return super().embed_query(\"query: \" + text)\n",
    "\n",
    "\n",
    "embeddings = E5Embeddings(\n",
    "    model_name=EMBEDDING_MODEL,\n",
    "    cache_folder=MODEL_PATH,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0000004",
   "metadata": {},
   "source": [
    "## Schritt 1 – Dokumente laden (PDF & DOCX)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "b640ed5d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Mapping: Dateiendung → passender LangChain-Loader\n",
    "LOADERS = {\n",
    "    \".pdf\":  PyMuPDFLoader,\n",
    "    \".docx\": Docx2txtLoader,\n",
    "}\n",
    "\n",
    "\n",
    "def load_documents(source_dir: str):\n",
    "    \"\"\"Lädt alle unterstützten Dokumente (PDF, DOCX) aus einem Verzeichnis.\"\"\"\n",
    "    all_docs = []\n",
    "    file_count = 0\n",
    "\n",
    "    for ext, loader_cls in LOADERS.items():\n",
    "        for filepath in glob.glob(os.path.join(source_dir, f\"*{ext}\")):\n",
    "            loader = loader_cls(filepath)\n",
    "            all_docs.extend(loader.load())\n",
    "            print(f\"  📄 {os.path.basename(filepath)}\")\n",
    "            file_count += 1\n",
    "\n",
    "    print(f\"\\n✅ {len(all_docs)} Seiten/Abschnitte aus {file_count} Dokumenten geladen.\")\n",
    "    return all_docs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0000005",
   "metadata": {},
   "source": [
    "## Schritt 2 – Text in Chunks aufteilen"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "b640ed5e",
   "metadata": {},
   "outputs": [],
   "source": [
    "def split_documents(docs, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP):\n",
    "    \"\"\"Zerlegt Dokumente in überlappende Text-Chunks.\"\"\"\n",
    "    splitter = RecursiveCharacterTextSplitter(\n",
    "        chunk_size=chunk_size,\n",
    "        chunk_overlap=chunk_overlap,\n",
    "    )\n",
    "    chunks = splitter.split_documents(docs)\n",
    "    print(f\"✅ {len(chunks)} Chunks erzeugt (Größe: {chunk_size}, Overlap: {chunk_overlap})\")\n",
    "    return chunks"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0000006",
   "metadata": {},
   "source": [
    "## Schritt 3 – Embeddings erstellen & in ChromaDB speichern"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "b640ed5f",
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_vectorstore(chunks, embedding_fn, db_dir=DB_DIR):\n",
    "    \"\"\"Erzeugt Embeddings und speichert sie in einer lokalen ChromaDB.\"\"\"\n",
    "    # Alten Index löschen, damit keine veralteten Vektoren drin bleiben\n",
    "    if os.path.exists(db_dir):\n",
    "        shutil.rmtree(db_dir)\n",
    "        print(\"🗑️  Alter Vektor-Index gelöscht.\")\n",
    "\n",
    "    print(\"🧠 Erstelle Embeddings und speichere in ChromaDB...\")\n",
    "    vectorstore = Chroma.from_documents(\n",
    "        documents=chunks,\n",
    "        embedding=embedding_fn,\n",
    "        persist_directory=db_dir,\n",
    "    )\n",
    "    print(f\"✨ Vektor-Index gespeichert in '{db_dir}'\")\n",
    "    return vectorstore"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0000008",
   "metadata": {},
   "source": [
    "## Schritt 4 – BM25-Index erstellen & persistieren\n",
    "\n",
    "BM25 ist ein **lexikalisches** Suchverfahren – es arbeitet mit Wort-Häufigkeiten\n",
    "statt mit Vektoren. Damit ergänzt es die semantische Suche ideal:\n",
    "\n",
    "| | Vektorsuche (E5) | Lexikalische Suche (BM25) |\n",
    "|---|---|---|\n",
    "| **Stärke** | Paraphrasen, sinngemäße Fragen | Exakte Begriffe, Fachtermini, Abkürzungen |\n",
    "| **Schwäche** | Exakte Schlüsselwörter gehen unter | Versteht keine Synonyme |\n",
    "\n",
    "Wir speichern die Chunks als Pickle-Datei, damit der BM25-Retriever\n",
    "im RAG-Notebook sofort geladen werden kann, ohne die Dokumente erneut zu parsen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "b640ed61",
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_bm25_index(chunks, bm25_dir=BM25_DIR):\n",
    "    \"\"\"Speichert die Chunks für den BM25-Retriever als Pickle-Datei.\n",
    "    \n",
    "    Der BM25-Retriever von LangChain baut seinen Index zur Laufzeit\n",
    "    aus LangChain-Documents auf. Wir müssen also nur die Documents\n",
    "    persistieren – den eigentlichen BM25-Index berechnet der Retriever\n",
    "    beim Laden automatisch (das dauert nur Sekundenbruchteile).\n",
    "    \"\"\"\n",
    "    os.makedirs(bm25_dir, exist_ok=True)\n",
    "    \n",
    "    index_path = os.path.join(bm25_dir, \"chunks.pkl\")\n",
    "    \n",
    "    # Alten Index löschen\n",
    "    if os.path.exists(index_path):\n",
    "        os.remove(index_path)\n",
    "        print(\"🗑️  Alter BM25-Index gelöscht.\")\n",
    "    \n",
    "    with open(index_path, \"wb\") as f:\n",
    "        pickle.dump(chunks, f)\n",
    "    \n",
    "    print(f\"✨ BM25-Index gespeichert in '{index_path}' ({len(chunks)} Chunks)\")\n",
    "    return index_path"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0000007",
   "metadata": {},
   "source": [
    "## Pipeline ausführen"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "b640ed60",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🚀 Starte Indexing-Pipeline...\n",
      "\n",
      "  📄 Physik_2016_Loesungsvorschlag.pdf\n",
      "  📄 Physik_2020-Loesungsvorschlag.pdf\n",
      "  📄 Physik_2019-Loesungshinweise.pdf\n",
      "  📄 Physik_2009.pdf\n",
      "  📄 Physik_2014_Loesungsvorschlag.pdf\n",
      "  📄 Physik_2008.pdf\n",
      "  📄 Physik_2018-Loesungshinweise.pdf\n",
      "  📄 Physik_2012_Loesungsvorschlag.pdf\n",
      "  📄 Physik_2010_Loesungsvorschlag.pdf\n",
      "  📄 2025_Phy_eAN_Aufgaben_V.pdf\n",
      "  📄 Physik_2009_Loesungsvorschlag.pdf\n",
      "  📄 2023_Phy_eAN_Aufgaben_2023_09_21.pdf\n",
      "  📄 Physik 2021 - Aufgaben.pdf\n",
      "  📄 Physik 2023 - Lösungshinweise.pdf\n",
      "  📄 Physik 2020 - Aufgaben.pdf\n",
      "  📄 2024_Phy_eAN_Aufgaben_V_.pdf\n",
      "  📄 Physik_2015_Loesungsvorschlag.pdf\n",
      "  📄 Physik_2012.pdf\n",
      "  📄 Physik_2006.pdf\n",
      "  📄 Physik 2022 - Aufgaben.pdf\n",
      "  📄 Physik_2013_Loesungsvorschlag.pdf\n",
      "  📄 Physik 2020 - Lösungsvorschlag.pdf\n",
      "  📄 Physik_2007.pdf\n",
      "  📄 Physik_2013.pdf\n",
      "  📄 Physik 2025 - Aufgaben.pdf\n",
      "  📄 Physik_2005.pdf\n",
      "  📄 Physik_2011.pdf\n",
      "  📄 Physik_2018-Aufgaben.pdf\n",
      "  📄 Physik_2010.pdf\n",
      "  📄 Physik_2004.pdf\n",
      "  📄 Physik_2014.pdf\n",
      "  📄 Physik 2024 - Aufgaben.pdf\n",
      "  📄 Physik_2015.pdf\n",
      "  📄 Physik_2017.pdf\n",
      "  📄 Physik 2023 - Aufgaben.pdf\n",
      "  📄 Physik_2019-Aufgaben.pdf\n",
      "  📄 Physik_2016.pdf\n",
      "\n",
      "✅ 494 Seiten/Abschnitte aus 37 Dokumenten geladen.\n",
      "✅ 1781 Chunks erzeugt (Größe: 512, Overlap: 128)\n",
      "\n",
      "--- Index 1: Vektordatenbank (ChromaDB) ---\n",
      "🧠 Erstelle Embeddings und speichere in ChromaDB...\n",
      "✨ Vektor-Index gespeichert in './chroma_db'\n",
      "\n",
      "--- Index 2: Lexikalischer Index (BM25) ---\n",
      "✨ BM25-Index gespeichert in './bm25_index/chunks.pkl' (1781 Chunks)\n",
      "\n",
      "🎯 Beide Indizes erstellt! Nächster Schritt: Notebook 3 (RAG) öffnen.\n",
      "   📁 Vektor-Index:  ./chroma_db\n",
      "   📁 BM25-Index:    ./bm25_index\n"
     ]
    }
   ],
   "source": [
    "# --- PIPELINE ---\n",
    "print(\"🚀 Starte Indexing-Pipeline...\\n\")\n",
    "\n",
    "docs   = load_documents(PDF_SOURCE_DIR)\n",
    "chunks = split_documents(docs)\n",
    "\n",
    "# Index 1: Semantisch (Vektordatenbank)\n",
    "print(\"\\n--- Index 1: Vektordatenbank (ChromaDB) ---\")\n",
    "db = create_vectorstore(chunks, embeddings)\n",
    "\n",
    "# Index 2: Lexikalisch (BM25)\n",
    "print(\"\\n--- Index 2: Lexikalischer Index (BM25) ---\")\n",
    "bm25_path = create_bm25_index(chunks)\n",
    "\n",
    "print(f\"\\n🎯 Beide Indizes erstellt! Nächster Schritt: Notebook 3 (RAG) öffnen.\")\n",
    "print(f\"   📁 Vektor-Index:  {DB_DIR}\")\n",
    "print(f\"   📁 BM25-Index:    {BM25_DIR}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "79d7ba08",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PyTorch: 2.8.0\n",
      "MKL verfügbar: False\n",
      "Threads: 8\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "print(f\"PyTorch: {torch.__version__}\")\n",
    "print(f\"MKL verfügbar: {torch.backends.mkl.is_available()}\")\n",
    "print(f\"Threads: {torch.get_num_threads()}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv (3.12.10)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
