Professional Documents
Culture Documents
Data Description:
Dataset 1: Trip data for February month (12 datasets for 12 months from January to
December for year 2013)
Attributes description:
Medallion: a permit to operate a yellow taxi cab in New York City, it is effectively a (randomly assigned) car ID.
Hack license: a license to drive the vehicle, it is effectively a (randomly assigned) driver ID.
Vendor id: e.g., Verifone Transportation Systems (VTS), or Mobile Knowledge Systems Inc (CMT)..
Dropoff longitude and dropoff latitude: GPS coordinates at the end of the trip.
Dataset 2: Fare data for February month (12 datasets for 12 months from January
to December for year 2013)
Attributes description:
Medallion: a permit to operate a yellow taxi cab in New York City, it is effectively a (randomly assigned) car ID.
Hack license: a license to drive the vehicle, it is effectively a (randomly assigned) driver ID.
Vendor id: e.g., Verifone Transportation Systems (VTS), or Mobile Knowledge Systems Inc (CMT).
Pickup datetime: start time of the trip, mm-dd-yyyy hh24: mm:ss EDT.
Fare amount: the meter fare, it should include the Newark surcharge, in USD.
Surcharge: Extra fees, such as rush hour and overnight surcharges, in USD.
Tolls amount: total price paid for tolls, summed across all tolls for the trip, in USD.
Total amount: all charges that are presented to the passenger at time of fare payment (includes tip for non-cash trips), in USD.
Final dataset:
3590754 rows with 21attributes. (February 1st week)
Quantile plots for few of the attributes: Single variable analysis.
Total medallions: 13306
The above graph shows that most of the medallions were not used, while some were used more than 500 times in a
week (~71 times a day).
X axis: Each medallion.
Y axis: medallion frequency.
Total number of hacks: 29836.plotting for number of trips made by each hack for top 5000 and bottom 5000 gives
the above graph. It tells that
Most of the hacks (drivers) were not utilized, where as a few hacks have driven over 400 trips in a week (57
trips per day) - some up to an average of over 7 trips a day.
It looks like ~70% of cab rides have a single passenger, and zero passengers was also reported.
14% of cab rides have double passengers.
4% of cab rides have 3 passengers.
2% of cab rides have 4 passengers.
7% of cab rides have 5 passengers.
3% of cab rides have 6 passengers.
I'm not sure how trip distance is calculated, but almost 90% of cab rides are 6 miles or less.
Only 8% of cab rides covered in between 6 miles to 11 miles.
2% of cab rides covered in between 11 to 18 miles.
The left graph shows that there is an outlier with 480 dollars, eliminating it with a condition of <100 dollars
gives us the right side plot.
Median cab fare is 10 dollars from the right side graph.
93% of the cab rides gets less than ~25 dollars.
4% gets in between ~25 to ~40.
3% in between~ 40 to ~56 dollars.
By removing the outlier from the left graph by placing a condition tip amount<25, we get the right side graph.
levels : 5
> freqTable head <
CRD : 1972931
CSH : 1606335
NOC :
DIS :
7494
2339
This gives us some initial interesting insight. Credit card and cash are used nearly in equal proportion.
Other than CRD and UNK, tip amounts are zero.
Each point in the graph indicates the pickup location by considering both the pickup latitude and longitude.
Pick up locations are used to map the city streets and analyze the traffic mobility within city streets This is how the latitude and
longitude columns are used in my analysis.
2)
The graph represents the relationship between the total amount paid and the distance covered.
The green space in the graph shows the highest count (how many times both the paid amount and distance
travelled happened), cabs carrying passengers to lower distances is highest.
Similarly each color space indicates the frequency of amount paid and distance travelled.
Graph shows the number of trips made in each week day, where every point represents the total trips in lakhs
made on that particular day.
Friday and Saturday records high number of trips where as Sunday and Monday records low number of trips,
the rest of the days lie in between these two.
Graph represents the number of trips made at every hour on all weekdays, the variation in the number
of trips in different days at different time periods.
Looking at Monday to Friday, starting from 5AM to 8AM there is a steep increase in the number of trips,
then there is a gradual decrease up to 10AM and trip numbers remains constantly fluctuating up to
3PM.
The graph shows the number of trips made in each week day in the morning peak hours, where each point
represents number of trips in that particular hour from (0-59 minutes). Time period is 5AM TO 7:59AM.
Saturday and Sunday morning hours shows a high decrease in the number of trips when compared with rest
of the days.
Graph represents the percentage of occupied cabs and unoccupied cabs at different time instances
from 00:00AM to 11:59AM.
Every point on red line represents the percentage of occupied cabs and points on blue line indicates
percentage of unoccupied cabs in that particular hour from 0 to 59 minutes.
Considering Saturday from 3AM to 3:59AM, cabs occupied percentage is ~40%, where as unoccupied
percentage is ~60%.
On Wednesday from 12 AM to 12:59 AM both the occupied and unoccupied percentages remain same.
Graph represents the percentage of occupied cabs and unoccupied cabs at different time instances from 12PM
to 23:59PM.
Every point on red line represents the percentage of occupied cabs and points on blue line indicates
percentage of unoccupied cabs in that particular hour from 0 to 59 minutes
Considering Saturday from 23:00PM to 23:59AM, cabs occupied percentage is ~79.6%, where as unoccupied
percentage is ~19.4%.
On Sunday from 7PM TO 7:59PM the occupied percentage is ~53% and unoccupied percentage is 47%.
X axis: distance travelled, Y axis: Total fare excluding the toll value.
The green space indicates the highest frequency of occurrence of both the distance and fare amount
(excluding tolls).
Most of the cabs carry passengers to lower distances when compared with other frequency counts.
Future work:
This analysis is made only for 1st week data taken from February month, here I
used local disk as my backend in coming up with this results. But the same analysis
can be made using GOOGLE Big query cloud store to fasten the process.
Same code can be made to run on the entire 1 month data , provided high
processing machine.
More interesting insights can be drawn by comparing the results from different
months.
References:
Data downloading and data understanding
http://publish.illinois.edu/dbwork/open-data/
http://www.andresmh.com/nyctaxitrips/
Previous work
http://hafen.github.io/taxi/#background
R packages
http://tessera.io/docs-datadr/#key_value_pairs
http://tessera.io/docs-trelliscope/