Hosting on prem LLM (SSC)

 
  • Hengjian Zhang - AI application support engineer
  • Nvidia ai blueprint
  • monitoring
    • zipkin, grafana
    • prometheus
  • RAG : NeMO retriever + paddle ocr
  • Milvus vector DB
  • Why
    • uva has a chatbot already (so we should as well ;p)
  • Advantages
    • monitoring and safety control
    • NVIDIA inference microservice (higher throughput)
    • “dont need to maintain software” (aka someone else does it for you)
  • spike - license obtained by tue
    • its an AI supercomputer (DGX B200)
  • nvidia/llama-3.3-nemotron-super-49b-v1
    • send data to tue server instead of openai being the only selling point
  • at the moment
    • deployed on a vm
    • no dynamic scaling yet
    • no money to do this atm so doesnt “really” exist
      • if demand, then the tue will spend money on it
    • has guardrails but that requires even more GPU resources
  • SURF
    • self hosted GPUs are not okay
      • so azure is to be used
    • 4xA100 but not enough
    • edu genai is a service by them
    • spike-1 is not too stable rn
  • supercomputing@tue.nl
  • IMO : kinda sucks right now (I guess) (or not????)