All Things Data Science

This blog gives you an insight into the basics of Data Science using a dataset from the NYC MTA bus service.

Ever wondered how google maps quickly identifies the fastest route for work or how the arrival time is determined so you exactly know when to leave.

Google collects live traffic information from Android phone users traveling on a given route. The data is then used to determine the routes with the maximum number of vehicles. To estimate the arrival time of a user, the average speed of travel for the route is taken into consideration. This may vary depending on the traffic condition at different times of the day.

According to Josh Wills, a statistician and a former Director of Data Engineering at Slack, “Data Scientist is a person who is better at statistics than any programmer and better at programming than any statistician.”

The field of Statistics has an influence over all domains of life such as the Stock market, life sciences, weather, retail, insurance, and education, to name but a few.

Dataset link:

The data is collected from the GPS locations of the MTA buses in NYC to monitor the live traffic updates. Each record in the dataset includes the direction of the bus, its location, distance from the stop, line name etc at regular intervals of 10min.

Before we dive into the statistics of our data, let’s look at the four scales of data measurement:

  • Nominal : A nominal scale describes a variable that has no numerical significance or natural order or ranking. Computing the mean or median for it would be meaningless. For example Names of students in a class, Serial numbers in a table, etc.
  • Ordinal : An ordinal scale is one where the order of the values matter, but the difference between them doesn’t. An example of ordinal data is the choices in a customer satisfaction survey, where the difference between two choices is not necessarily meaningful.
  • Interval : An interval scale is one where order and difference values have significance but there is no absolute zero, for example, the difference between the temperatures 10°C and 20°C is not the same as the difference between 40°C and 50°C. Thus, computations like multiplication and division cannot be done on it.
  • Ratio : A ratio variable has all the properties of an interval variable along with a true zero value. For example Amount of money spent, distance traveled by car, area of a yard, etc.

Based on this, let us classify the attributes of our dataset:

  • Nominal : PublishedLineName, OriginName, DestinationName, VehicleRef, NextStopPointName, ArrivalProximtyText
  • Interval : RecordedAtTime, OriginLat, OriginLong, DestinationLat, DestinationLong, VehicleLocation.Latitude, VehicleLocation.Longitude, ExpectedArrivalTime, ScheduledArrivalTime
  • Ratio : DistanceFromStop
  • Categorical : DirectionRef

Measures of central tendency

Measures of central tendency are something we all have been hearing since childhood. These are summary statistics that indicate where the center of your data lies. They represent the point where most values in the distribution will fall. Choosing the best measure of central tendency depends on the type of data you have.

In statistics, the three most common measures of central tendency are the mean, median, and mode. Each of these measures calculates the location of the central point using a different method.

  • Mean : It is the arithmetic average. The calculation of mean incorporates all values in the data. However, mean doesn’t always locate the center of the data accurately. In a skewed distribution, the mean can miss the mark. As the distribution becomes more skewed, the mean is drawn further away from the center. It is best to use mean when the distribution is symmetric.

Mean of every numeric column (interval and ratio) present in the dataset

Here, we try to find the central location based on the locations of all the bus stops.

We can’t use the mean of latitude and longitude directly. Reason being, the earth is not flat. A change in position by 1° longitude can lead to a difference of 111.321Km (Near Equator) to 0Km approx (Near Poles). Hence, We considered WGS84 earth model and converted these coordinate values to cartesian coordinate systems which make it easier to find central location/coordinates, which were later converted back to Latitude and Longitude.

The location comes out to be 40.717734964493374 N, 73.9384030561444 W which is close to East Williamsburg, Brooklyn, NY. A little bit of googling tells me that Williamsburg is one of the best neighborhoods in Brooklyn to live, with excellent restaurants, bars, and shopping. Imagine the amount of time that would be saved which is wasted waiting for buses if one moves to this area, with plenty of bus stops closeby.

  • Median : Median is the middle value. It splits the dataset exactly in half. Unlike the mean, median value doesn’t depend on all the values in the dataset. Hence, when some of the values are more extreme, the effect on the median is smaller. In the case of skewed distribution, median is a better choice than mean.

Median of every numeric column (interval and ratio) present in the dataset

  • Mode : Mode is the value that occurs the most frequently in the dataset. A multimodal distribution is the one with multiple modes. When you have a categorical dataset it is best to go with mode as the measure of central tendency.

Mode of every column present in the dataset (Nominal, interval and ratio)

Most frequent origin and destination bus stops for buses moving towards the city (DirectionRef 1) and away from the city (DirectionRef 0)

Most frequent paths taken by the buses from the origin bus stops found above

Most frequent paths taken by the buses to the destination bus stops found above

Most frequent path with DirectionRef = 0

Most frequent path with DirectionRef = 1

We find that there are no buses directly from those frequent initial to final stops

Data Distribution

Data distribution using visualizations like histograms, box-plot, kernel density estimation(kdeplot) and Quantile-Quantile(qqplot).

A Q-Q Plot helps us determine if the data under study is normally distributed. For a normal data we should see the points forming a line that’s roughly straight.

A box plot is a type of chart used in data analysis to show the distribution of numerical data and skewness through displaying the data quartiles and averages. If the quartiles or the divisions of the blue box are equal, data is a perfect normal distribution.

  • Plots for Vehicle Location Longitude column

The data in VehicleLocationLongitude column follows normal distribution along with a few other columns although there is a slight deviation from the actual distribution owing to the presence of certain outliers and the fact that no real-time data can be perfect. Adding to the above observations we’ve also pointed out that the central tendency measures are all equal which is usually the case when the data follows a normal distribution.

  • Plots for Distance From Stop Column

The above plots show that DistanceFromStop follows an exponential distribution because the number of buses reaching the intended stop grows exponentially.

Hypothesis Testing

The main purpose of statistics is to test a hypothesis. For example, you might run an experiment and find that a certain drug is effective at treating headaches. But if the same test when run on a larger population gives inaccurate results then the drug cannot be trusted and the results previously obtained were just by chance.

Similarly, we try to make an assumption about our data and using appropriate hypothesis tests test whether the results obtained will be applicable to a large population.

Null hypothesis, Ho, is generally taken as the hypothesis that negates the relationship. For example, the BMI does not vary with height. While alternate hypothesis, H1, generally accepts the relationship. BMI varies with height.

To check whether the results of the test are solid, the value obtained from the hypothesis test (p-value) is used. It is checked against a threshold, usually 5% or 1%.

If p < threshold, Ho is rejected, else accepted.

Here, we test the relationship between the Origin and the Bus location using 2 tailed T-test as both the attributes are normally distributed

On plotting a histogram of the above distance it is found that a significant number of buses are located close to the Origin.

The output obtained from the T-test :

The above two results show that there is some kind of relation between the Origins and bus locations and is not a mere coincidence.

Relationship Analysis

Till now we have only established the fact that a relationship exists between the origin and bus location. To determine the direction and strength of this relationship we make use of correlation and covariance.

Correlation value always lies between 0 and 1. 1 indicates a perfect relation and 0 indicates no relation (independent variables)

A negative correlation or covariance value signifies a negative relationship ie. when one variable increases, the other decreases and vice versa. While a positive value signifies a positive relationship ie. when one variable increases, the other increases with it, the same goes for decreasing values.

For the relationships we are trying to analyse,

Although the covariance value came out very small, the correlation value is high which shows that a strong relationship exists between the two and the lower value of covariance might be due to the presence of a large range of values in the columns.

An important point to be noted about correlation is it does not imply causation. Sometimes factors that are correlated may not be connected by a cause and effect relationship. It is one thing to say that there is a correlation between rain and the monthly sale of umbrellas, but an entirely different thing to say that rain caused the sales. Unless you’re selling umbrellas, it might be difficult to prove that there is cause and effect. Thus, from the results obtained above no conclusion can be drawn about the dependence of one variable on another.

This takes us to another important step in data analysis, which is Regression analysis.

Regression Analysis

Through Regression analysis we try to determine the impact of origin location of the bus on its current location by fitting a linear curve over the data. An equation of the form

VehicleLocation= m * OriginLocation+ C + error

is obtained from the regression analysis. OriginLocation being the independent variable, VehicleLocation being the dependent variable, m(the slope) and C (the intercept).

The error term tells you how certain you can be about the formula. The larger it is, the less certain the regression line.

R-square(R2) score evaluates the scatter of the data points around the fitted regression line. Higher R-square values represent smaller differences between the observed data and the fitted values. Hence, the larger the R2, the better the regression model fits your observations.

Equation obtained from the Origin vs Bus latitude curve :

Y = 0.842 X + 6.404

Equation obtained from the Origin vs Bus longitude curve :

Y = 0.803 X -14.563

From the linear regression analysis done on the data, we found that the goodness of fit measure (R2 score) is quite high and the error measurements are less hence proving that there is a relation between the Origins and the Vehicle Locations.

Github link for the code:

A bunch of enthusiastic data scientists in the making