A Welcoming Introduction to Linux Environment Variables

“Environment Variables” might sound intimidating to people who first come to use Linux, but they are actually a quite simple tool that can unlock more power & flexibility for us when we try to execute programs in Linux. Here I want to summarize my understandings towards this and hopefully can help people who are stumbling through the process of learning Linux, like I did.

Note: Most of the things I talk about in this article should also apply to Unix-like operating systems, but I’ll refer to Linux here which is just another descendant of Unix. You can read more about the relationship and difference between Unix and Linux here and here.

Another note: I try to be precise in choosing the terminology but I might be using command line, shell session, terminal interchangeably. Here’s another post in which I clarify the differences.

What’s Environment, and What’s Environment Variable?

The environtment in Linux is an abstract concept; there’s no physical space or directory that represents the “environment”. Instead, the environment is simply just a collection of states that control how our computer should behave. The environment variables are switches/knobs we use to change the states.

Nearly everything we do within our Linux machine is by executing programs, either via GUI (graphical user interface) or CLI (command-line interface). For example you might have been opening your VS Code editor by clicking its icon on your desktop, but we can also achieve that by executing:

dian@ubuntu % code

which also brings up a new VS Code window.

Usually, we can tell the program what to do or give it extra information by specifying its command-line options. For example, if we use g++ (or clang++) to compile a cpp executable, we can tell it where to find the header files and libraries by specifying the -I and -L options from the command line, respectively:

dian@ubuntu % g++ main.cpp -I<our-include-dir> -L<our-library-dir> ...<other options>

Another way to do this is to use the CPLUS_INCLUDE_PATH and LIBRARY_PATH enrivonment variables. By setting CPLUS_INCLUDE_PATH=<our-include-dir> and LIBRARY_PATH=<our-library-dir> beforehand, we can save quite some typing in the command line everytime we need to run the compilation:

dian@ubuntu % g++ main.cpp ...<other options>

because g++ is aware of these environment variables and will read them upon execution to get the paths. (As a side note, if we have set the environment variables but still want to be flexible during each execution, usually the program will allow us to override or complement the settings by using command line options, which have higher priority.)

In short, the environment is a collection of states/settings, from which our programs can read necessary information to do their job. We can change the settings by setting the environment variables. Different programs will look for environment variables specific to their own purpose, which is pre-determined when the programs themselves were created. In the case of g++, it will look for CPLUS_INCLUDE_PATH and LIBRARY_PATH to get the include paths and library paths. The environment variables are another source to pass information to a program for its execution, apart from command line options [TODO: 3 info sources for programs].

Some Common Environment Variables

On most Linux distributions, there are some common environment variables that we can get upon booting. Just to list a few:

  • HOME: the home directory, e.g., /home/dian
  • USER: the current username, e.g., dian
  • SHELL: the current shell program, e.g., /bin/zsh
  • PWD: the current working directory, e.g., /home/dian/<wherever-i-navigate>/ [TODO: the PWD variable]
  • LANG: the current system language, e.g., en_US.UTF-8

Among these, HOME and PWD are probably the more frequently used ones when we develop our scripts. We can show the value of an environment variable by calling echo $VAR in the shell command line, with a dollar sign $ in front of the variable name:

dian@ubuntu % echo $HOME
/home/dian

Notice that, these variables are present upon booting, or more precisely, when we start a terminal session. Next we are going to discuss how environment variables are set in general and why these variables are already here.

How to Set and Use Environtment Variables?

Shell Variable vs. Environment Variable

Most of the time the way we perceive the Linux environment is via the shell command line, or, terminal. So the easiest way to set a variable is to directly assign a value to it by:

dian@ubuntu % MY_NUM=233    # notice we cannot use spaces around the =
dian@ubuntu % echo $MY_NUM
233

No magic syntax, and we can print it out using echo right after it. But, so far MY_NUM is still a shell variable, it’s not an environment variable yet, meaning that the current shell can make use of it, e.g., printing its value, but the child processes executed from this shell have no access to it. Let’s see this in effect using Python’s os.getenv function, which is a common utility that retrieves the environtment variable by name:

dian@ubuntu % python
Python 3.8.3 (default, May 19 2020, 13:54:14) 
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.getenv("MY_NUM")
>>>

The fact that os.getenv("MY_NUM") prints nothing shows that the python child process didn’t get access to the newly defined MY_NUM indeed. To make it an environment variable and perceivable by child processes, we need to use the export keyword to “export” the shell variable into the environment:

dian@ubuntu % export MY_NUM
dian@ubuntu % echo $MY_NUM
233
dian@ubuntu % zsh   # this starts a child zsh shell from the current shell
dian@ubuntu % echo $MY_NUM
233
dian@ubuntu %

After being exported by export keyword, $MY_NUM now becomes available to the child processes. Now, we’ve seen the first way of defining an environment variable: define it as a shell variable in the shell, and export it. We can combine the two steps together in one line:

dian@ubuntu % export MY_STRING=hello
dian@ubuntu % echo $MY_STRING
hello
dian@ubuntu %

Let’s see two other ways of defining environment variables and using them in the child processes:

Prepend the Definition Before the Command

A handy, one-time use of enrivonment variable is to define it in the same line before the command we are going to run. This way the enrivonment variables will be effective only to this execution, and we can override the existing (if any) values with the ones we provide here, but only this time. For example:

dian@ubuntu % MY_NUM=23333 python -c "import os; print(os.getenv('MY_NUM'))"
23333   # MY_NUM was overriden
dian@ubuntu % echo $MY_NUM
233     # MY_NUM is back to 233
dian@ubuntu %

As we can see here the $MY_NUM specified in front of the one-liner python command overrides the existing one with value 23333. After the python sub-process finishes the $MY_NUM variable goes back to the value held by the shell, which is 233.

A common usage among deep learning practitioners is to specify the GPUs to use in each run of the experiments, e.g.:

dian@ubuntu % CUDA_VISIBLE_DEVICES=2,3 python train.py

Use Source Scripts

Instead of remembering and typing all environment variables everytime we execute a command, we can group all the environment variables we need into a shell script, for example:

# my_source.sh
export CUDA_VISIBLE_DEVICES=2,3
export DATA_PATH=$HOME/data:$PATH

And before executing the program, we “source” this script first to get all the variables exported to current shell environment by calling source:

dian@ubuntu % source my_source.sh
dian@ubuntu % echo $CUDA_VISIBLE_DEVICES
2,3
dian@ubuntu % python -c "import os; print(os.getenv('CUDA_VISIBLE_DEVICES'))"
2,3
dian@ubuntu %

This is as if we’ve called those two lines of definition in the shell in-place:

dian@ubuntu % export CUDA_VISIBLE_DEVICES=2,3
dian@ubuntu % export DATA_PATH=$HOME/data:$PATH

The source command is worth some detailed discussion. The my_source.sh in this example is no more than a normal shell script, so if we directly execute it without using source like this:

dian@ubuntu % chmod +x my_source.sh
dian@ubuntu % ./my_source.sh

which is equivalent to running it this way:

dian@ubuntu % zsh my_source.sh      # export happens in the child shell

and then try to print out the environtment variables:

dian@ubuntu % echo $CUDA_VISIBLE_DEVICES

dian@ubuntu % 

We can see the environment variables make no effects after zsh shell finishes executing my_source.sh. This is because zsh is actually invoked from the current shell, which makes it a child process of the current shell, meaning that everything it defines is only effective to itself and its own child processes. Even though we’ve used export keyword in front of the variables, they are exported to the child shell’s environment and will be gone when the child shell terminates.

So, the effect of using source (and its equivalence, the . dot command) is to execute scripts within the current shell context without creating a child shell. Together with the export keyword, all variables defined in the scripts can be exported into the current shell environment, as if the commands are typed in the current shell line-by-line.

Don’t forget to use export or otherwise the variables won’t be exported to the environment either; they will end up being local shell variables.

Lifespan of Environment Variables and $HOME Revisited

So far, we’ve talked about the three main ways of defining environment variables in a shell session. If we end the shell session by calling exit or simply closing the terminal window, will we still see them again next time we open up a shell session? The answer is no. The lifespan of an environment variable is only confined to the shell session where it’s defined.

Wait, then why can we see environment variables like HOME each time as soon as we log in without doing anything?

The answers lie in something similar to source. In fact, each time when we start up a shell session, the shell will automatically execute a series of scripts behind the scene, called startup files, before it gets ready to receive any inputs and presents itself to us with a prompt. Some startup files are executed in a “source“-d way. These startup files contain many pre-defined variables (and functions, see [TODO: the export keyword]), which will be exported to the environment and become environtment variables of the shell. HOME is one of these environment variables. After this pre-processing, we have our default environment.

Depending on the operating system (e.g., Ubuntu, CentOS, OS X), the shell program (e.g., bash, zsh) and the type of the shell (e.g., login vs. non-login, interactive vs. non-interactive, see this post), there are different series of startup files that will be executed upon opening a shell session, from system-wide configuration to user-level configuration. Some common user-level startup files include ~/.bash_profile or ~/.bashrc if we use bash shell, and ~/.zshrc if we use zsh shell, etc. Usually we put our own variables and helper scripts in the user-level startup files while variables like HOME are more likely to be found in the system-level startup files.

To Export or Not to Export, That’s the Question

When should we export shell variables as enrivonment variables? Generally speaking, we use export when we need to pass the variables to the child processes executed from the current shell, i.e., the child processes need to have access to the variables. Some examples include:

  • Configure the system-wide & user-level environments using startup files
  • Configure the paths, flags, optimization options, etc., of compilers (such as g++, clang++)
  • Similarly, configure the build systems that care about paths, flags and optimization options, etc. (such as CMake [TODO: CMake])
  • Mask GPU visibility when running deep learning experiments
  • Any other programs that make use of enrivonment variables, even if a one-liner python command like python -c "import os; print(os.getenv('MY_NUM'))"

On the contrary, if the variables aren’t needed in the child processes and they are more for the purpose of the current shell’s execution, then it would be better if we keep them as just shell variables. Typical usage includes (as a side note, most of these use cases will happen in shell scripting since shell variables are for shells!):

  • Flags that control the execution flow of the script (such as if-else, while)
  • Similarly, loop counters during scripting (such as for, while)
  • Convenient containers storing the message for display

Although there are no strict rules that forbid us from doing one or another, it’s better to adopt this practice. Otherwise we might pollute our environment with unnecessary variables which might look confusing and even cause unintentional bugs.

Summary

In this article we’ve briefly introduced what Linux environtment variable is, why and how we use them. This topic spans across many other relevant topics such that we might not be able to exhaust every concept in this single article. I’ve sprinkled the links to the related articles here and there so feel free to navigate yourself to the ones that you find interesting.

A TLDR summary:

  • The Linux environment is a collection of states that control how our machine should behave.
  • We can change the states by setting environment variables which can be used to pass information to the program we execute.
  • There are three major ways to set environment variables:
    • exporting existing shell variables using export,
    • prepending one-time variables in front of a command, and
    • source a script which contains a bunch of exports.
  • The lifespan of an environment variable is confined to the shell session where its defined.
  • There are a bunch of startup files that prepare a default environment upon each shell session instantiation.
  • Generally speaking, it’s better to use export when we need the variable to be accessible by child processes, otherwise it’s better to keep it as a shell variable.