Overview Link to heading
AI and LLMs are all the rage these days. Incorporating them into your code can open up a number of possibilities. Using the online resources / APIs have some issues though:
- your code may need to survive even if the online resource changes or is no longer available
- your code may be rate limited or blocked if thresholds are exceeded
- the token cost may accummulate much quicker than you were expecting. In practice that $0.0012 process is fine, once you go live you might find though that that process is run millions of times in a short time frame. (Perhaps this is a good problem to have?)
- there may be legal issues passing your data or intellectual property to an upstream provider who is not always clear how that data may be used.
Sometimes running a LLM process locally is a better choice.
When considering local LLMs, you quickly discover Ollama. And you quickly discover that interacting with Ollama with code is not always quite the way the docs say it is. The docs and tutorials can become outdated in short order. That applies to this article as well. What works for me today may not work for you by the time you read this. And with that disclaimer out of the way, let us look at this.
For this article we will assume you have Ollama installed, running, and providing one or more models for use. You can run ollama list on your ollama server to see the list of available models. If you do not have a model, check Ollama’s Models Page, and then run ollama pull MODEL_NAME for your chosen model.
Know your Ollama Host Link to heading
If your python code is not running on the same physical box as your Ollama server, you’ll need to indicate where to find your Ollama server. How you do this depends on what code method you are attempting.
Environment Variable
In most cases setting an environment variable before running your code can resolve the connection issue:
export OLLAMA_HOST="192.168.123.123" # Adjust the server's IP.In some cases you can include the port number here so “192.168.123.123:11434”, for example.
Setting this only takes effect for the duration of the shell you are in. You may need to ensure this variable is set in a different manner if you need it to survive closing your shell or rebooting.
A better option might be to set this directly before call your application:
OLLAMA_HOST="192.168.123.123:11434" python3 yourapplication.pyMake your code aware of the host.
I find I often have to include other configuration variables, like a database host, port, and credentials. I usually use the python-dotenv package for this. (Or just create a
config.pyfile, set my values there, and import it into my code as needed.)Install the dotenv library with
pip install python-dotenvcreate or edit your
.envfile in your python project folder:# /your/python/project/.env OLLAMA_HOST="192.168.123.123:11434"WARNING: if your
.envfile contains sensitive information like API keys, connection details, passwords, etc. then you should NOT include this file in your source control. The.envfile is routinely excluded via.gitignorefor these reasons.Now call the load_dotenv() method as soon as you can in your python code. This will find your
.envfile, and create temporary environment variables you can then use in your code:import os from dotenv import load_dotenv load_dotenv() print(os.getenv('OLLAMA_HOST'))
I believe the second option is the more robust solution, but your specific needs will dictate which approach is most appropriate.
Connecting to Ollama with Python Link to heading
There are two ways to connect to your Ollama server that I have found.
- Using the REST API,
- Via a python module such as the Ollama Library or LangChain
In either case, it is assumed you have Ollama installed, running, and providing one or more models for use. You can run ollama list on your ollama server to see the list of available models. If you do not have a model, check Ollamas Models Page, and then run ollama pull MODEL_NAME for your chosen model.
The Rest API approach: Link to heading
Using the REST API simply calls a URL, passing along the information needed. To do so you’ll need to be able to issue a web request. In Python this is usually handled with the requests library.
# create an example file to work with, with these contents. i.e. ollama_test.py
# Make sure the file is in an environment that has requests and python-dotenv installed
import json
import requests
import os
from dotenv import load_dotenv
def ask_ollama(question, ollama_host="127.0.0.1:11434"):
if not ollama_host:
print('could not determine the OLLAMA_HOST - trying the local URL')
# Create the URL string for the REST request
ollama_url = f"http://{ollama_host}/api/generate"
# define the prompt or question we are going to ask of the ollama server
prompt = question
# Configure the request parameters
payload = {
"model": "llama3.1", # or whichever model you're using
"prompt": prompt,
"stream": False,
"options": {
"temperature": 0.2,
}
}
# execute the HTTP request
response = requests.post(ollama_url, json=payload)
#lets see what we actually get back
# print(response)
# return the text part of the respons
return json.loads(response.text)['response']
# Assuming you have a .env file defined as described above
load_dotenv()
ollama_host = os.getenv('OLLAMA_HOST', '127.0.0.1:11434')
answer = ask_ollama("why is the sky blue", ollama_host)
print(answer)
And then we can call that file with
python ollama_test.py # use the file name you chose for the above code
When I run this on my server, using the llama3.1 model, I get the following response:
The sky appears blue to us because of a phenomenon called scattering, which occurs when sunlight interacts with the tiny molecules of gases in the atmosphere. Here's a simplified explanation:
**Scattering and Rayleigh's Law**
When sunlight enters Earth's atmosphere, it encounters tiny molecules of gases such as nitrogen (N2) and oxygen (O2). These molecules are much smaller than the wavelength of light, so they scatter the light in all directions.
The amount of scattering that occurs depends on the wavelength of the light. Shorter wavelengths (like blue and violet) are scattered more than longer wavelengths (like red and orange). This is known as Rayleigh's Law, named after the British physicist Lord Rayleigh who first described it in 1871.
**Why Blue Light is Scattered More**
Blue light has a shorter wavelength (around 450-495 nanometers) compared to other colors. As a result, blue light is scattered more by the tiny molecules of gases in the atmosphere. This scattering effect makes the sky appear blue to our eyes.
**Other Factors that Affect the Sky's Color**
While scattering is the main reason for the sky's blue color, there are other factors that can affect its appearance:
* **Time of Day**: During sunrise and sunset, the light has to travel through more of the atmosphere, which scatters the shorter wavelengths (like blue) even more. This is why the sky often appears redder during these times.
* **Atmospheric Conditions**: Dust, pollution, and water vapor in the air can scatter light in different ways, making the sky appear hazy or grayish.
* **Earth's Atmosphere**: The atmosphere itself scatters light, but it also absorbs some of the longer wavelengths (like red), which is why we don't see as much red light from space.
So, to summarize: the sky appears blue because of scattering by tiny molecules in the atmosphere, with shorter wavelengths like blue being scattered more than longer wavelengths.
Yay! Too much information!
If you use a different model, or omit the temperature setting, you may see a different result.
While this was a success and acheived our overall goal of local LLM use, there are some downsides we will discuss below.
The Python Library approach Link to heading
Using the python library method is not quite as straight forward. Each library has its own quirks and function definitions. These are not always consistent and can lead to some frustrations as you need to dig deeper to understand specifics about the library you are trying to use.
I’m using the ollama library for our examples here. My experience with LangChain is not always a successful endeavor. But, keep in mind the basic concepts we go over also apply to LangChain and other libraries.
import os
from dotenv import load_dotenv
import ollama
def ask_ollama(question, ollama_host="127.0.0.1:11434"):
# Create the URL string for the REST request
ollama_url = f"http://{ollama_host}"
# Create a client object we will use to connect to our server
client = ollama.Client(host=ollama_url)
# Then we can call the appropriate method on that client.
# Here we call .generate() as it is closest to the same process we did with the REST approach.
response = client.generate(
model='llama3.1',
prompt=question)
# Notice the response object is different than in the REST API method.
return response["response"]
# Assuming you have a .env file defined as described above
ollama_host = os.getenv('OLLAMA_HOST', '127.0.0.1:11434')
if not ollama_host:
print('could not determine the OLLAMA_HOST - trying the local URL')
answer = ask_ollama("why is the sky blue", ollama_host)
print(answer)
The difference from the REST request approach is that the details of interacting with the Ollama system and its responses are understood by the ollama object. So the object wraps up some of the coding effort and makes life easier for us.
HOW you specify the Ollama server is highly dependent on the library you are using. In some cases, like above, a connection object is needed where the server details can be specified, and then the connection object is used to interact with the server via prompts/questions. In other cases, the base_url or host parameter can be specified with every request/command method. In some cases a specific environment variable may be needed. You will need to understand the library you are choosing, which should be in the documentation for that library.
Downsides to hosting locally Link to heading
Both connection methods mentioned above will suffer from the same negative conditions:
performance - the simple example question above saw my CPU utilization peg 100% for all my cores for approximately 60 seconds. This is a direct result of
- The llama3.1 model. It is a 4GB file that has to be fully loaded into memory and processed for every request. This takes some time. Using a smaller model will be more performant, but may not give results as good as the larger model.
- My local hardware. The better the hardware the Ollama server runs on the more performant the results will be. In my case, I happen to be on a Ryzen 7 CPU with 16GB RAM, and an Nvidia GTX1650 graphics card. My graphics card is a base model with GPU support and not as performant as a much newer card with better GPUs. Your performance will not likely be the same with more modern hardware.
- The configuration of the Ollama server. When installing Ollama it attempts to find GPU support and configures this if available. Otherwise Ollama is installed with CPU support only. If that were the case for me, that one minute request might have taken 5 minutes instead, if the process didn’t just crash.
consistency - running the same questions multiple times may not give you the same results. Adjusting the temperature setting, and perhaps specifying a “seed” value (alongside the temperature setting) could help but is not always a guarantee you’ll get a consistent result. The good news is that this is NOT an Ollama specific issue, or even a model specific issue. Some models may be more consistent than others, but my experience thus far is that they all suffer this limitation. (Disclaimer, my own experience should not in any way be defined as “the standard” to apply.)
dependability - I find that if I work my local Ollama server too hard things tend to become unstable. In some cases I freeze my computer until the processing is done, in other cases my process may crash and take out the shell or editor window I’m working in. This is more noticable for the memory intensive tasks I might run. The problem is that some questions and model combinations may not appear to be memory intensive until you are in the midst of process.
Whether or not the performance issue is a concern depends on your task. There are many tasks where delays just don’t matter. This is not an approach I would recommend for direct public website access, where 3 seconds might be too long for a response. The issue may be mitigated through creative use of caching and other techniques though.
Conclusion Link to heading
Using a local Ollama server offers a number of potential opportunities for some interesting code. Especially if you are looking into Retrieval Augmented Generation (RAG) where you want to be able to ask questions of your internal documentation.
Just like any other technology though, you need to be concious of the impacts and costs involved.
For myself, I have no hesitation using an external tool like OpenAI, Google Gemini, Anthropic’s Claude, or Meta AI’s LLama models - I can accept the risks for myself. But if a customer asks me to use AI or LLMs to generate content for them, I would set up an internal system that does not expose customer data to external entities. Not unless I have a written document from them indicating this is OK and they indemnify me against the liabilities for this.
The above shows two methods you might use to talk to an Ollama server to achieve that customer focused effort.