Click
on a cell (i.e. A10), hold the shift key,
click
on
another cell (C15) to select the cells between A10 and C15.
can also click on a cell and drag the mouse to the desire range
select not-adjacent cells, click on a cell, press ctrl and select another cell
or range of cells.
Excel
stores your work in a workbook, each workbook has one or more worksheets
(and/or charts) which you can view by clicking on the sheet tab (lower left corner
of the active (current) sheet).
Entering
can type anything on a cell, in general you can enter text (or labels),
numbers, formulas (starting with the "=" sign), and logical values (as in
"true" or "false").
Click
on a cell and start typing, once you finish typing press "enter" (to move to
the next cell below) or "tab" (to move to the next cell to the right)
can write long sentences in one single cell but you may see it partially
depending on the column width of the cell (and whether the adjacent column is
full). To adjust the width of a column go to Format -- Column -- Width or select
"AutoFit Selection".
Numbers
are assumed to be positive, if you need to enter a negative value use the minus
sign ("-") or enclose the number in parentheses ("(number)").
you need to enter percentages, dollar sign, or any other symbol to identify the
number just add the "%" or "$". You can also enter the number and change its
format using the menu: Format -- Cell and select the "number" tab which has all
the different formats.
Dates
are automatically stored as mm/
dd
/
yyyy
(or the default format if changed) but there is some flexibility here. Enter
month and number and excel will enter the date in the default format. If you
press "ctrl" and ";" (
Crtl
-;) excel will enter the
current date.
is also entered in a default format. Enter "5 pm", excel will write "5:00 PM".
To enter the current time press "ctrl" and ":" (Ctrl-:)
practice enter the following table (these data are made-up, not real)
Let"s
say for example you do not want
capital letters for the columns "Last Name" and "First Name". You do not want
"SMITH" you want "Smith". Two options, you can re-type all the names or you can
use the following formula
(
IMPORTANT
: All formulas start
with the equal "=" sign)
:
=
PROPER(
cell with the text you want to change)
can start answering some of these questions by looking directly at the table,
for some other questions you may have to do some calculations by obtaining a
set of
descriptive statistics
. These statistics are a collection of
measurements of two things:
location
and
variability
. Location
tells you the central value (the mean is the most common measure of this) of
your variables. Variability refers to the spread of the data from the center
value (i.e. variance, standard deviation). Statistics is basically the study of
what causes such variability.
Let"s
get some descriptive statistics for this data. In excel go to Tools -- Data
Analysis. If you do not see "data analysis" option you need to install it, go
to Tools -- Add-Ins, a window will pop-up and check the "Analysis
ToolPack
" option, then press OK. Try running data analysis
again.
For Excel 2007 see
http://
office.microsoft.com/en-us/excel/HP100215691033.aspx
For Excel 2003 see
http://
office.microsoft.com/en-us/excel/HP011277241033.aspx
Since
we include the
labels in first row
make sure to check that option. For
the
output option
which is the place where excel will enter the results
select O1 or you can select a new worksheet or even new workbook.
Check
"Summary statistics" and the press OK. You will get the following:
While
the whole descriptive statistics cells are selected go to Format--Cells to
change all numbers to have one decimal point. When you get the "format cells"
window, select the following:
average student in this sample is 25.2 years, has a SAT score of 1848.9, got a
grade of 80.4, is 66.4 inches tall and reads the newspaper 4.9 times a week. We
know this by looking at the "m
ean"
value on each variable.
mean
is the sum of the observations divided by the total number of
observations. It is the most common indicator of central tendency of a
variable. If you look at the last two rows: "Sum" and "Count" you can estimate
the mean dividing "Sum" by "Count" (sum/count). You can also calculate the mean
using the function below
(
IMPORTANT
: All functions start
with the equal "=" sign):
"
Sum
" refers to the sum of all the values in a range of
values. For age means the sum of the ages of all students. The excel function
for sum is:
The
"
Standard
Error
" (SE) indicates how close the sample mean is from the "true"
population mean. The average age of 25.2 years is just an estimate of this
sample of students but it can vary had you used a different set of students.
The standard error is calculated by dividing the standard deviation of the
population (or the sample) by the square root of the total number of
observations. The SE can be used to roughly define a range of certainty for the
mean. Using "age":
median
is another measure of central tendency. To get
the median you have to order the data from lowest to highest. The median is the
number in the middle.
If the number of
cases is odd the median is the single value, for an even number of cases the
median is the average of the two numbers in the middle. The excel function is:
mode
refers to the most frequent, repeated or common
number in the data. By age there are more students 19 years old in the sample
than any other group. In the SAT scores the mode is "#N/A" which means that all
values are unique. The excel function is:
sample
variance
measures the dispersion of
the data from the mean. It is the simple mean of the squared distance from the
mean. It is calculated by:
= sum of (X-mean of X)
2
/ Number of observation minus 1
Higher
variance means more dispersion from the mean.
The excel function is:
standard
deviation
is the squared root of the
variance. Indicates how close the data is to the mean. Assuming a normal
distribution, 68% of the values are within 1
sd
from
the mean, 95% within 2
sd
and 99% within 3 sd. The
excel formula is:
Skewness
measures the asymmetry of the data,
when in an otherwise normal curve one of the tails is longer than the other. It
is a roughly test for normality in the data (by dividing it by the SE). If it
is positive there is more data on the left side of the curve (right skewed, the
median and the mode are lower than the mean). A negative value indicates that
the mass of the data is concentrated on the right of the curve (left tail is
longer, left skewed, the median and the mode are higher than the mean). A
normal distribution has a
skew
of 0. Skewness can
also be estimated with the following function:
Kurtosis
.
The current view of kurtosis argues
that it measures the peak of a distribution.
According to
Peter Westfall, that view is not quite correct.
His article "Kurtosis as
Peakedness, 1905--2014.
R.I.P."
(
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4321753/
)
makes a compelling case against the current perception. In Westfall"s view, the
peak, or lack-thereof, is a symptom rather than a characteristic that shows the
presence of outliers. High kurtosis may suggest the presence of outliers.
Technically speaking, kurtosis focuses more on the tails for the distribution
than the peak, so positive kurtosis indicates too few cases in the tails or a
tall distribution (leptokurtic), negative kurtosis too many cases in the tails
or a flat distribution (
platykurtic
). A normal
distribution has a kurtosis of 0 (given a correction of -3, otherwise it will
have a kurtosis of 3). The excel function for kurtosis is:
explore the data by groups you can sort the columns for the variables you want
(for example gender, or major or country, etc.) and obtain descriptive statistics
by selecting only the range of values that cover particular group. You can also
use pivot tables.
Let"s
say you are interested on looking at the average SAT score by gender and
student"s major. Let"s make the following
crosstabulation
the right side of the wizard layout you can see the list of all variables in
the data. Click and drag "Gender" into the "ROW" area. Click and drag "Major"
into the "COLUMN" area, and click and drag "Sat score" into the "DATA" area.
The wizard layout should look like this:
is a
crosstabulation
between gender and major. Each cell
represents the average SAT score for a student according to gender and major.
For example a female student with an econ major has an average SAT score of
1952 (cell B5 in the picture) while a male student also with an econ major has
1743 (B6). Overall econ major students have an average SAT score of 1806 (B7) .
In general, female students have an average SAT score in this sample of 1871.8
(E5) while male students 1826 (E6).
more information on pivot tables go to the following site.
http://www.microsoft.com/dynamics/using/excel_pivot_tables_collins.mspx
Let"s
say you want to explore whether there is a relationship between the average
score (grade) of each student and his/her major. In the sample we have three
majors: Econ, Math and Politics. The grades are the final grades for the entire
academic year.
do this we use one-way ANOVA, which stands for "analysis of variance". ANOVA
"is a broad class of techniques for identifying and measuring the various
sources of variation within a collection of data" (
Kachigan
,
p. 273, 1986). It is closely related to regression analysis but with the
following difference: "[w]e can think of the analysis of variance technique as
testing hypotheses about the
presence
of relationships between predictor
and criterion variables, regression analysis as describing the
nature
of
those relationships, and r
2
as measuring the
strength
of the
relationships" (
ibidem
.) In other words, ANOVA
"tests whether the means of
y
[grades
in this example] differ across categories of
x
[majors]" (Hamilton, p. 149)
the above in mind, let"s see if there is a relationship between student"s
majors and student"s final grades. First we need to rearrange the data so excel
can run the ANOVA. Using only the columns "major" and "average score (grade)".
Copy and paste both columns into a new sheet, sort by major (Data--Sort,
select the column for major and sort ascending) separate by group. Final table
should look like this..
looks
similar
to the one we got when we obtained
"descriptive statistics". Select the input range, check "labels in First Row",
and select as output range "D1", click OK. You"ll get the following:
now you should be familiar with the summary statistics presented in the first
table. You may notice that the "sum" column has decimals while the data seems
to be integers. The sum has decimals because some of the scores have decimals;
they are just rounded to the nearest integer.
Sources
of variation
. The
analysis of variance requires the estimation of two variances: between groups
(econ, math and politics) and the within groups (students).
SS
. Sum of square deviations
Degrees of freedom. For between groups is 2 (number of majors minus 1) and for
within groups is 27 (number of students minus number of majors).
MS
. Mean square of deviations (variance
estimates), which is equal to SS/
df
, Roughly 411/2
and 2549/27.
F
. Is a probability distribution.
It
is the ratio of two variances.
Roughly 205/94=2.18. According to
Kachigan
, the F is
the ratio of:
P-value
. This is the value that answers your
question. We wanted to know whether there is some sort of relationship between
majors and grades. ANOVA assumes by default that there is no relationship. As a
general rule, a p-value greater than 0.05 means ANOVA"s assumption may be
right. We got a p-value of 0.13 which is greater than 0.05, so it seems there
is no relation between a student"s major and his/her final grade. Had the
p-value been lower than 0.05 then we would have found some kind of relationship
between majors and grades.
F-crit
. It is the critical value to check
whether we reject of fail to reject ANOVA"s assumption. Check the table for
0.05 confidence at
http://www.statsoft.com/textbook/sttable.html#f05
open Stata go to Start -- Programs -- Stata[ver.*] -- Stata[*]. For cluster
computers contact OIT for instructions. When you open Stata this is what you
will see:
When you work
with Stata there are three basic procedures you may want to do first:
create
a log file
,
set your working directory
, and
set the correct
memory allocation for your data)
log file records everything you type and get while working in Stata. Commands
and output are send to a text file for you to review later. Think of it as a
"tape recorder" for your Stata session. To
create
a log file go
to File -- Log -- Begin
The second
thing to check is your working directory. To do this in the command window type
the following
Which stands
for "print working directory". This will show you your working directory, which
right now, in this example is H:\statadata.
The third
initial step is to set the necessary memory allocation. In the picture above
you can see in green letters after "Notes:" that the memory allocation is 10
mb
. This will be enough for a medium size database but
sometimes you may need more memory space to store your dataset. To determine
the size of your dataset follow the formula:
Size (in
bytes) = (8*Number of cases or rows*(Number of variables + 8))
Depending on your
Stata version and computer power, you can allocate up to around 2 gigabytes. To
allocate 1 g you can type:
NOTE
: Not recommended for really big
datasets or datasets with long string variables and lots of special characters
(like ";","
,"
,"#","%", etc.)
to Stata, click on the "Data editor" icon
Numbers
are always black. Red indicates error, in the editor"s case indicates that
values are not numbers, in this case letters or string characters. Close the
data editor by clicking on the "X" in the upper right corner
Stata
automatically eliminates the space in your original titles but keep the format
in the "Label" column. "Type" refers to whether the data is number or string (
str
*). "Format" shows the length of the variable. In the
command window type
help format
for
details.
start
exploring the data you may want to know how many
graduates and undergraduates are in the sample. For this type in the command
window (type
help tab
for more details):
Click here to
get the data
Crosstabulation
shows you the subgroups formed by two
variables. You can see that in the sample there are 10 econ majors 3 of which
are females and 7 males. You could also say that there are 15 males 7 of which
are econ major, 2 math and 6 politics.
If you want
percentages by major instead of counts type:
command window type
describe
.
The
describe
command will provide you info for the
active dataset and the format of the variables ("display format"). [Hit enter
or spacebar to see the rest of the list]. Type
help describe
for further details (if the "--more--
" message bugs you, type
set
more off
)
you compare these results with the excel file you will see they are basically
the same with the exception of Skewness and Kurtosis which Stata calculates
differently.
Tabstat
is another command that provide
summary statistics
the command line type. To
fastrack
type
tabstat
and then click on each variable in the variables
windowo
. The "s" before the parenthesis stands for
"statistics" here you select the statistics you need.
tabstat
age sat
averagescoregrade
heightin
newspaperreadershiptimeswk
, s(mean
semean
median
sd
var
skew k count
sum range min max )
example, for the cross between females and Econ. A female student with an econ
major has an average SAT score of 1952, with a standard deviation of 312 and in
the sample there are only three students in this category.
Without
the option sum(sat), we will get a simple
crosstabulation
between gender and major
By
the way, remember the little warm-up we had in excel
converting last and first names into proper format? Well, we can do that in
Stata as well. The following introduces a way to generate new variables (type
help generate
for more details)
will produce a rolling standard deviation every three years as indicated in the
option
window(
)
below, adjust it to your desired window:
cd
H:\
use
http://dss.princeton.edu/training/Panel101.dta
,
clear
xtset
country year
rolling
x1_sd
=r(
sd
),
window(3) saving(
x1_sd
): sum
x1
use
x1_sd
rename
end year
save, replace
use
http://dss.princeton.edu/training/Panel101.dta
,
clear
merge 1:1
country year
using
x1_sd
drop
_merge
more details type
help
rolling
One-way
ANOVA tests whether the mean of the dependent variable (
y
) is statistically significant among different categories of the
independent variable (
x
). The format
oneway
[measurement] [categorical]
the example below, we are interested on testing whether a student"s major has
some effect on his/her grade. Type:
Comparing
these with the results using excel (shown below) they are pretty close. "
Prob
>F" is the p-value which has to be lower than 0.05
(for 95% confidence) to be significant. Conclusions are basically the same.
you go to the menu and click "Graphics" you will see all the graphing options available
in Stata. If you do not have it already,
click
here
to get the data to do these graphs.
Let"s
see one
basic
scatterplot. We will add some options later.
Scatterplots are good to explore possible relationships between variables and
to identify outliers. In this case we want to explore visually whether there is
some relationship between age and SAT scores. If there is some kind of
relationship we would be able to see a specific patter (linear, curve, concave,
etc.). For many more bells and whistles type
help scatter
in the command window. The format is
twoway
scatter y x
.
For starters let"s type:
local command is used for macros an assigns strings names to macros. In
this case we create temporary variables. To
make the graph with the quadrants type:
you want to set your own parameters for the quadrants just type the number in
the "
yline
" and "
xline
"
options. For example we want lines that cross age at 30 and SAT at 1800:
you want to include the confidence bands we have to reverse the order of the
graphs because the shaded area tends to cover the dots. So we graph the
confidence region first, then the scatter.
[
IMPORTANT
: If you get the message "(0
real changes made)" make sure you spell the names correctly, Stata is
case-sensitive.]
change the desired positions use the option
mlabv
()
Histograms
are another good way to visually explore data, especially to check for a normal
distribution; here are some examples (type
help histogram
in the command window for further details):
graph
hbar
(mean) age
averagescoregrade
newspaperreadershiptimeswk
, over(gender) over(
studentstatus
,
label(
labsize
(small)))
blabel
(bar)
title("Student indicators")
graph
hbar
(mean) age
averagescoregrade
newspaperreadershiptimeswk
, over(gender) over(
studentstatus
, label(
labsize
(small)))
blabel
(bar) title(Student indicators) legend(label(1
"Age") label(2 "Score") label(3 "
Newsp
read"))
graph categorical data in Stata you will need a special program called
catplot
. If your version of Stata does not
have it, you can install it by typing
Notice you may have to create the variable "
agegroup
" which is a recode of "age" where 1 "18 to 21" 2
"22 to 39".
labels on the graph correspond to the number of students on each group. For
example there are 6 students econ majors with ages between 18 to 21.
you are interested on the percentages within "
agregroup
"
you can specify this as follows:
percent()
option indicates the reference group
displayed in the graph. The labels on the previous graph correspond to the
second row in the crosstab between
agegroup
and
major.
you are interested on the percentages within "major" you can specify this as
follows:
Kachigan
, Sam,
Statistical analysis: an interdisciplinary introduction to univariate
& multivariate methods
.
New
York
: Radius Press, c1986.
Textbook Examples.
Regression with Graphics. by
Lawrence
Hamilton
http://www.ats.ucla.edu/stat/examples/rwg/
Stata Library. Graph
Examples (may not work with STATA 10)
http://www.ats.ucla.edu/STAT/stata/library/GraphExamples/default.htm
help-sheets for STATA