Instructions


These are step-by-step instructions on how to match your AVL data to a GTFS schedule. (AVL means Automatic Vehicle Location, a less specific term for GPS bus tracking)

Prerequisites

You will need: If you are missing any of the first 3:

1. Set up python database access

If you're using postgres or other socket-communicating db server, then in the config folder there is a file called db.config.example. Make a copy of this file called db.config and edit it appropriately.

If you're using anything other than postgres, modify common/src/dbutils.py to handle your DBMS (shouldn't be difficult).

2. Create the schema for bus tracking data

From the core/sql folder, apply the make_tables.sql file to your database.

3. Pull the bus tracking data into the DB

It is up to you to figure out how to do this for your particular data source. From the perspective of Python, the goal is to get your data points into the format requested by the update_routes() method inside the core/sql/dbqueries.py file. Otherwise, you will need to fit your data into the db scheme, in particular into the vehicle_track table that was created in step 2.

To grab San Francisco Muni tracking data you can just use the code that is already provided in the nextmuni_import directory. This is set up to pull the data directly from the nextmuni xml service.

To use it, follow these steps:

3.1. Fix SFMuni GTFS data

From the nextmuni_import/sql directory, apply the change_sf_gtfs.sql file.

3.2. Set up a job to collect the data

Essentially, you want the python script found at nextmuni_import/src/route_scraper.py to run every 1 minute (or however often you choose), giving it "ALL" as an argument:

       python route_scraper.py ALL 

To get this to run every minute, the easiest thing is to set up a cron job. You're on your own to figure this out, or to figure out any other method to have this run on a regular basis.

NOTE: Before you choose how to provide data to the db, see step 4.2 below.

4. Do some pre-matchup setup

Now that there is tracking data in your database, you probably want to match that to the GTFS schedule, since you are using this package. To do this you first need to do some preprocessing.

4.1. From the core/src directory, run the following

  python Preprocessing.py trips 

This prepares the schedule data for faster matching with tracking data. It can take a while but it saves time in the long run.

4.2. Populate routeid_dirtag table

For this there are two options:

a) Populate the routeid_dirtag table manually. The idea is to match NextBus' "dirtag" values up with GTFS route ID's. The reason to do this manually is that NextBus and GTFS both are riddled with errors that you may or may not have caught yet.

NOTE: If you aren't using a NextBus data source, then you must choose what to do depending on what you provided for the dirtag field in step 3.

b) Try to automatically populate the routeid_dirtag table. This will not always work, since as I said in (a) there are lots of data errors over which I have no control nor which do I have any method to detect. You can do this as follows (from core/src directory):

     python Preprocessing.py populate_ridtags 

Before you do this though you might want to check out the results from the following SQL queries:


     select count(*), route_short_name as cnt from gtf_routes 
       group by route_short_name having count(*) > 1;

     select count(distinct routetag) from vehicle_track 
       where dirtag != 'null' and routetag != 'null'
       group by dirtag having count(distinct routetag) > 1;

If there are 0 results then you are in good shape. If there are a lot of results then you are not in so good shape, and should give serious consideration to (a). The data I currently have for SF Muni returns 121 rows for the second query.

4.3. Implement get_direction_for_dirtag() method in core/src/dbqueries.py

This has already been implemented correctly for SF Muni data, however it is unlikely to be correct for anything else. The dirtag field collected from GPS data must indicate not only the route but also the direction in which that route is heading; this direction must be matched up with the GTFS concept of direction which is either '1' or '0'.

SFMuni NextBus data, for example, includes either "IB" or "OB" within the text of each dirtag. In the SFMuni GTFS data, inbound buses have a direction_id of '1', and outbound buses have that of '0'. Therefore the implementation currently in place simply looks for either "OB" or "IB" in the dirtag.

5. Match GPS data to GTFS schedules (hoorah!)

NOTE: You might want to selectively apply some of the indexes found in the core/sql/make_table_indexes.sql file, before doing the following.

From the core/src directory, run:
  python Preprocessing.py match_trips 
You have now successfully matched the trips from 4.1 to GTFS scheduled trips. Now, run:
  python Preprocessing.py gps_schedules

This populates the data telling you the actual schedule that each GPS trip followed (i.e., when each bus actually arrived at each stop). Specifically, this data is put into the gps_stop_times table (compare to gtf_stop_times).

6. Make some corrections

Some situations can result in erroneous assignments after following step 5. To fix these, run

  python Preprocessing.py fix_earlybirds 

This searches for better matches for routes which are abnormally off-schedule.

You can visually test the gtfs<->gps matchings using the OSMViz package, available here. Once you have OSMViz, go to the core/src/ directory and look at the visualizer.py file. By editing the query at the end of the file (inside the "if __name__ == '__main__'" clause), you can choose what to visualize. Run "python visualizer.py" and enjoy the show.

7. Get ready to analyze

From the core/sql folder, you will want to apply the make_table_indexes.sql file to your database. This will make all future analyses much much faster.

8. Analyze

There is some postprocessing/analysis code in the postprocessing directory. The postprocessing/sql directory holds code for creating a summary table of bus lateness data, as well as some various queries that were useful to me when exploring the data. The postprocessing/src directory holds the Stats.py file which performs a bunch of statistical analysis and makes plots of the data, but it requires the SciPy and Matplotlib libraries.