AI Untwined
Operations
There are several fundamental types of AI/ML operations, including:
- Embedding – Converts input data into vector representations.
- Inference – Applies a trained model to make predictions or classifications.
- Training – Adjusts model parameters based on data to learn patterns.
- Fine-tuning – Adapts a pre-trained model to a specific task.
- Preprocessing – Cleans, normalizes, or structures raw data before use.
- Postprocessing – Refines model outputs for final consumption.
- Retrieval – Searches and fetches relevant information (e.g., RAG).
- Generation – Produces new content (e.g., text, images, code).
- Optimization – Refines a system or model for efficiency or accuracy.
- Distillation – Transfers knowledge from a large model (teacher) to a smaller one (student) for efficiency.
- Quantization – Reduces model precision (e.g., FP32 → INT8) for faster execution.
Model Variants & Flavors
AI models differ based on architecture, use case, and efficiency:
- Transformers – Large, general-purpose (e.g., GPT, LLaMA, Mistral).
- Sentence Transformers – Optimized for embeddings (e.g., SBERT).
- Diffusion Models – Used for image generation (e.g., Stable Diffusion).
- RNNs/LSTMs – Sequential data modeling (less common now).
- Small-scale Models – Lighter models for edge devices (e.g., DistilBERT, Gemma).
- On-Device ML – TinyML, MobileNet, Whisper small versions.
Frameworks & Libraries
- PyTorch (open-source) – Most flexible, widely used for research and deployment.
- TensorFlow (open-source) – Industry adoption, supports mobile (TensorFlow Lite).
- JAX (open-source) – Optimized for high-performance computing.
- ONNX (open-source) – Standard format for cross-framework model execution.
- Jupyter Notebooks -
Model Execution & Optimization
- ONNX Runtime – Runs ONNX models efficiently on CPU/GPU.
- TorchScript – Converts PyTorch models to optimized bytecode.
- TensorRT (proprietary) – NVIDIA’s high-performance inference engine.
- GGUF (GPT-based) – Optimized quantized formats for local LLMs.
Platforms for Running Models Locally
- Hugging Face Transformers (open-source) – Load/run pre-trained models easily.
- Ollama (open-source) – Simplifies running local LLMs.
- LM Studio (open-source) – GUI for running LLMs locally.
- Whisper.cpp/Llama.cpp (open-source) – Optimized for CPU-based execution.
- FastText (open-source) – Lightweight text classification/embedding.
The “Two Stages”
Machine learning workflows typically consist of two major phases:
-
Experimental Stage (Research & Prototyping)
- This is where models are built, tested, and refined.
- Often involves:
- Trying different architectures (CNNs, RNNs, Transformers, etc.).
- Hyperparameter tuning.
- Working with datasets interactively.
- Traditionally done in Python using PyTorch, TensorFlow, Jupyter Notebooks, etc.
- Optimized for flexibility rather than performance.
-
Deployment Stage (Production)
- Once a model is finalized, it needs to be deployed for inference.
- Requirements shift towards:
- Efficiency (low latency, high throughput).
- Scalability (runs efficiently on CPUs/GPUs in production).
- Typically, models are “rewritten” in C++ (mlpack, dlib) or optimized with TensorFlow Serving, ONNX Runtime, or TensorRT for performance.
How Models Can Be “Rewritten”
The term “rewriting” a model usually refers to optimizing or converting it from one framework to another for production use. This happens in several ways:
a. Manual Rewriting (Traditional)
- Data scientists prototype models in Python (PyTorch, TensorFlow).
- Engineers rewrite them in C++, Rust, or optimized C libraries for efficiency.
- Example:
- A TensorFlow model trained in Python is rewritten using C++ (dlib, mlpack) for deployment.
b. Model Conversion (Modern Approach)
- Instead of manually rewriting models, conversion tools allow them to be transformed:
- ONNX: Converts models from PyTorch/TensorFlow to ONNX, which can run efficiently with ONNX Runtime.
- TorchScript: Converts PyTorch models into deployable binaries.
- TensorRT: Optimizes TensorFlow/PyTorch models for NVIDIA GPUs.
- GGUF (Llama.cpp format): Optimizes LLMs for local execution.
Learning Plan: Mastering Embedding Deployment & Search in Go
🔹 Goal: Develop expertise in running, optimizing, and deploying various embedding models locally using Go.
🔹 Scope: Focus on deployment, inference optimization, and retrieval—not model training.
🟢 Step 1: API-Based Embeddings with Local Tokenization
✅ Goal: Learn embedding fundamentals by using a 3rd-party API (e.g., VoyageAI, OpenAI) while handling tokenization locally.
🔹 Key Concepts:
- How embeddings work & what they represent.
- Tokenization (splitting text into subwords) to control request size.
- Making efficient API calls in Go.
🔹 Tech Stack:
- Embedding API: VoyageAI or OpenAI
- Tokenizer:
github.com/sugarme/tokenizer
🔹 Tasks:
- Install sugarme/tokenizer and test tokenization.
- Make an API call to VoyageAI/OpenAI for embeddings.
- Store embeddings in-memory or in a JSON file.
- Compute cosine similarity manually for a simple search function.
✅ Outcome: Understand embeddings, API usage, and tokenization in Go.
🟡 Step 2: Replace API Calls with Local Ollama Model
✅ Goal: Run embeddings locally using Ollama, removing external API dependencies.
🔹 Key Concepts:
- Running Ollama models on a local machine.
- Comparing local inference speed vs. API latency.
- Vector storage for search.
🔹 Tech Stack:
- Embeddings: Ollama (
ollama run mxbai-embed-large) - Vector Database: FAISS (
github.com/DataIntelligenceCrew/go-faiss) or Qdrant (qdrant-client-go)
🔹 Tasks:
- Install and run Ollama embedding model (
mxbai-embed-large). - Generate embeddings locally in Go via the Ollama API.
- Store embeddings in FAISS or Qdrant for fast retrieval.
- Implement similarity search (cosine distance, k-NN).
✅ Outcome: Hands-on experience with local inference and vector DB integration.
🟠 Step 3: Optimize Embedding Model Execution
✅ Goal: Experiment with different ways to optimize inference performance.
🔹 Key Concepts:
- Quantization (reducing model size for speed).
- Multi-threading for parallel inference.
- Benchmarking different models.
🔹 Tech Stack:
- Optimized Ollama models (quantized
.ggufversions). - Profiling tools (
pprof, Go Benchmarking). - Multi-threading in Go.
🔹 Tasks:
- Run different Ollama models (
nomic-embed-text,mxbai-embed-large). - Compare speed & accuracy of quantized vs. non-quantized models.
- Benchmark batch inference performance.
- Experiment with parallel embedding generation.
✅ Outcome: Ability to fine-tune performance settings for embedding models.
🔴 Step 4: Use ONNX Runtime for Faster Local Inference
✅ Goal: Run an ONNX-optimized embedding model using Go, bypassing Python overhead.
🔹 Key Concepts:
- ONNX format for running models across platforms.
- Optimized inference on CPU/GPU.
- Running ONNX models without Python.
🔹 Tech Stack:
- ONNX Runtime:
github.com/yalue/onnxruntime_go. - ONNX Models: Download from Hugging Face (
all-MiniLM-L6-v2.onnx).
🔹 Tasks:
- Install and load an ONNX model in Go.
- Generate embeddings using onnxruntime_go.
- Compare performance vs. Ollama embeddings.
- Integrate ONNX-based embeddings into FAISS or Qdrant.
✅ Outcome: Expertise in running optimized ONNX models for embedding generation.
📈 Summary: Building a Go-Based Embedding Powerhouse
| Step | Focus | Tech Used | Outcome |
|---|---|---|---|
| 1 | API-based embeddings & tokenization | VoyageAI/OpenAI, sugarme/tokenizer |
Learn embeddings, tokenization, API integration |
| 2 | Local embedding inference | Ollama, FAISS/Qdrant | Run embeddings locally, store vectors, implement search |
| 3 | Optimization & performance tuning | Ollama, Quantized Models, Go Benchmarking | Reduce inference latency, experiment with model settings |
| 4 | ONNX for high-performance inference | ONNX Runtime (onnxruntime_go) |
Deploy highly efficient embedding models locally |
Enhancements & Optional Next Steps
🚀 Expand skills further with:
- Use embeddings in a real-world app (e.g., a CLI for searching a Git repo).
- Deploy as a web service (Go + Gin/Fiber).
- Experiment with different vector DBs (Weaviate, Pinecone).
- Explore quantization & model compression further.
Final Thoughts
✅ This learning plan is focused on real-world skills—deploying, running, and optimizing embedding models in Go.
✅ Enhancements allow deeper performance tuning.
✅ No unnecessary detours into training models—just practical embedding-based search.
Would you like example code scaffolding for any step? 🚀