Changing String Variables to Categorical Variables and Vice Verse
Sometimes for whatever reason, string variables need to be categorical and categorical variables need to be strings. In Stata this is often true because Stata treats string-encoded variables as missing and will not use them in analyses. However, anticipating that this may be problematic, Stata offers various commands to change string variables into categorical variables and vice versa.
The first case most often occurs when importing data from another source. Sometimes, for whatever reason, Stata incorrectly calls a categorical variable a string variable. The easiest way to tell if this is the case is to look at the
window. If a variable is a string, the
will be
followed by some number. If, for example, you had a gender variable consisting of ones and zeroes that encoded as str1 and was therefore all numbers, you could use the destring command. If you want to replace the existing variable, the command is simply
destring [varname]
This will replace the existing specified variable with the same data but now in a nonstring format. If you prefer to retain the existing variable, you can
generate
a new variable that is a nonstring version of the existing variable. To do this type
generate [new variable name]=real([string])
In my example, this would look like
generate sex2=real(sex)
This command would create a new variable called sex2 that contained the numeric data from my original variable (sex) stored in a numeric format.
Both of these commands have a reverse: in the first case
destring
will revert the format to a string, and
generate name=string([numeric variable])
will generate a new string variable with the same data as the numeric variable specified, but not saved in a numeric format.
The above will only work if all of the data is numeric. However, sometimes it's not. In a case where your string variables are in fact strings (e.g., "female" instead of "1") you have to tell Stata to
encode [varname]
the string data. Running this command will cause Stata to make a new numeric categorical variable wherein the data has labels that correspond to the old string values.
If you do this, be aware that Stata is cap sensitive; female, Female and FEmale will be treated as three different types of data.
Encode is a slightly more complicated command, requiring a subcommand,
generate([newvariablename])
Continuing the gender example, the full command would look something like this
encode gender, generate(sex)
This would cause Stata to generate a new variable called "sex" that contains numeric categories based off the old variable (called" gender"). However if you
browse
the new variable it will look the same, because Stata displays the labels (not the raw numbers). The only visual clue that something is different is that the text will now be blue instead of black. The opposite of
encode
is
decode
The
decode
command has the same syntax as the
encode
command, but generates a string variable based on the labels of a numeric categorical variable.
The most complicated cases are those in which you import data with numeric and nonnumeric characters. Google Books offers some useful information on the subject
here
Reed College prohibits unlawful discrimination on the basis of race, color, national origin, religion, sex, sexual orientation, gender identity, gender expression, age, marital or familial status, military status, veteran status, genetic information, physical or mental disability, pregnancy, or any other category protected by federal, state, or local laws that apply to the college, in any area, activity or operation of the college, including in its employment policies, educational policies, admission policies, scholarship and loan programs, housing policies, athletic programs, and other school-administered programs.