# Scatterplots

Scatterplots are graphs that show how two variables are related. Both the x and y axis of a scatterplot go from low to high values and each pair of data points is plotted on the graph, just like you plot points on a coordinate plane.

If the points group in a pattern, the scatterplot will show that there is a relationship, or a correlation, between the two variables. If the points move upward and rightward, we say that there is a **positive correlation** (as one variable goes up, the other goes up as well). A good example of this is the relationship between age and height in children. As children get older, they tend to get taller as well. If you plotted the height of children on one axis, and the age on the other, you'd see a scatterplot like this:

If the points move downward and rightward together, we say that they are negatively correlated (as one variable goes up, the other goes down). A good example of this is the relationship between temperature and heating bills. As the temperature goes down, people's heating bills go up. If you plotted temperature on the x axis and then the amount of people's heating bills on the y-axis, you'd see a scatterplot like this:

If the points are randomly scattered then there is no correlation between the two variables. An example of this would be the relationship between how many pairs of shoes people have and how often they eat peas. If you graphed the number of shoes people own on the x-axis and the number of times they eat peas each month on the y-axis, you might see a scatterplot like this:

Often, it's enough just to be able to read a scatterplot and see if data have a relationship or correlation. Notice that some of the scatterplots above have numbers on the axes and some don't. To discern a pattern or a correlation in a scatterplot you often don't need numbers. It's enough to know that the the data trend together (or not). If you have numbers on the axes, then you can tell a little more about the data, but the trend remains the same.

Other times, it's important to be able to **create a scatterplot** or **read the specific data off of a scatterplot**. For that, you need units on the axes.

So, how do you make a scatterplot?

You start with an empty grid.

If you were making an actual scatterplot, you would label the x and y axis with the labels of your variables. In this particular case, we are going to leave them blank because we're going to make several different scatterplots (with different variables on the axes).

Then take a pair of values (a coordinate pair) and plot that pair as a point on the scatterplot.

Let's say that we have one person in the dataset who owns 20 pairs of shoes (x axis) and 20 video games (y axis).

This point would be (20, 20) and we can plot this coordinate the same way we plot points on a coordinate plane.

If we want to see if there's an overall trend between the number of shoes people own and the number of video games they own, you could plot the data from several people.

If we plot more datapoints, with people's shoes and video games, the scatterplot looks like this. The scatterplot shows that there is no relationship or correlation between the number of shoes people own and the number of video games they own.

Let's try a different dataset, what if we graph the number of hours a week people watch TV (x axis) and the number of hours a week people spend playing video games (y axis)?

If we plot a bunch of these datapoints, the resulting scatterplot might look more like this, and here you do see a relationship or correlation. This is a positive correlation: people who watch more hours of TV also tend to spend more hours playing video games.

So, how can we use this scatterplot to make predictions about people's TV viewing and video gaming habits?

Once you find a correlation (this scatterplot shows a positive correlation), you can **read individual plot points** (so, the person who watches TV for 25 hours a week also spends 5 hours a week playing video games).

You can also put in a **best fit line** in order to make predictions. The best fit line shows the general trend in the data. Using the best fit line you can predict what the data for someone who is not on the graph might look like. How many hours of video games might someone who spends 5 hours a week watching TV play? Using the best fit line, we can predict that someone who watches 5 hours a week of TV, will play 3 hours a week of video games (find where a 5 on the x-axis meets the best fit line).

When you draw a best fit line, you can ignore data that is completely off the pattern. We call these outliers. This scatterplot has two outliers (circled in the bottom right). Although these datapoints are real, but these people don't seem to follow the general trend (they watch a lot of TV but play very few video games) and are not very helpful for making predictions.

Remember, statistics are about trends and likelihood. Statistics will never predict everyone or everything, but we can use data, statistics, and probability to make educated guesses. This scatter plot tells us that we can predict that people who watch more TV will probably also play more video games (but we could be wrong).