Wednesday, 21 October 2015

Ask Data Anything - NYPD Motor vehicle accidents

In modern organizations, data management is a major issue and at the same time a major resource. In our experience, the first challenge a business that wants to use its data is facing how to have a unified view of their data. Generally data inside organizations is stored in different databases that have often proprietary API making it difficult to move from one database to the other. Furthermore, also when the technology used to store data is the same, there are still semantic problems like different terminologies, languages etc.


The bigger the company is, the lower the possibility to standardize the procedures are, so that these kind of situations will not happen. This happens because we are human and we naturally tend to interpret data using our own experience and knowledge. Thus we cannot expect the technical team to call all pieces of a car using the exact same terminology as the logistic department. This is why, our solution aims at giving the possibility to standardize the way in which the end user interact with the data without actually changing the source of the data.

Ask your Data Anything (ADA), allows companies to add a semantical layer on top of the data without the need of copying data. The product is managing term disambiguation, aggregation of data using hierarchies defined in ontologies, data integration between different data sources.


This tutorial shows ADA's capabilities and it is based on New York City Police Department's data on Motor Vehicle Collisions. It reveals how easily a person without specialist training can access exact data such as number of injured in accidents on every street, or a statistical comparison for number of killed between different types of vehicles. Results aggregated by locations can be shown on map to provide a complex overview of road safety in New York City. All necessary information for police officers, journalists and interested citizens is at hand.

 Many useful and free datasheets containing statistical data from New York City can be accessed and downloaded from this (https://data.cityofnewyork.us/website). The data we used for making tutorial is slightly different from the original one. Amount of data was largely reduced and column names were changed. A small ontology was used to describe crucial concepts.

Data structure

After importing data to ADA we can check what it contains by clicking "What's inside?".



Our data is a table in which every row represents information about particular accident. ADA's Dimensions tab contains names of columns, treated as ontological concepts.The Operations tab informs about possible operations to be used in query. The Output tab shows possible methods to visualize query results.
As one can see, the data is consisted of temporal information (date, time), location (borough, street, longitude, latitude, (zip)code), type of vehicle that had an accident, factor which was the main cause, number of killed and injured people (with distinction to injured pedestrians cyclists and motorists).

To view all data stored in particular column simply write its name and press ENTER.


Getting the data

One of the simplest questions one may ask is about getting the number of accidents in a particular area. It can be done by executing the following query:

Example query:
count by borough in Brooklyn



The count operation counts number of instances of a particular concept, in this case borough. By adding in Brooklyn results are restricted to only those that have Brooklyn as its borough. So the returned value is the number of accidents(rows) that happened in Brooklyn.

The number of accidents in a borough may be rated as high or low only when it is compared to similar rates in other regions of city. To compare some data, one can use summarize operation.

Example query:
summarize borough by city



After executing the query a table will be shown with number of accidents in each borough. Adding by city specifies the aggregation to city level. Raw data contains no information about city, it is added on top of it by using the ontology file.

A table is not always the best way to represent data. ADA provides additional output methods, for instance pie charts.

Example query:
summarize vehicle by borough on piechart 





Output mode can be changed by adding on followed by the name of output method. The above query returns a set of pie charts, each of them represents factors that causes accidents of one particular vehicle type.

Numerical data

ADA automatically recognizes date and numerical data, so a user does not have to worry about it. It also enables performing basic statistics.

Example query:
sum injured by borough on histogram


Sum injured sums all values from injured column. Typing by borough on histogram presents data divided by boroughs on a histogram.

Another way to present data is by a map. ADA supports Google Maps and can use it to visualize data if a concept, that a user is using to group results, is a location.

Example query:
sum killed by city on map




The size of a green circle represents the value compared to other data points on map. Exact value can be seen in a balloon.

The 'time' column contains hour time of accidents as decimal number (9:30=9.5). It enables using mathematical operations on its values.

Example query:
average time by city



Above query returns an average time value for New York City (because it is the only city in our example). It is approximately 16.7 which suggest that an average road accident in NYC takes place about 4:40 PM, during afternoon rush hours.

ADA allows user to do projections over the data and retrieve subsets of it that match certain mathematical expressions.

Example query:
street with killed > 1

The result shows all streets where accidents with fatalities occurred.

Summary 

Ask Data Anything enables swift access to needed data in a trivial way. A user can check names of streets where accidents with fatalities occurred, compare number of accidents in different boroughs on histogram, check how many percent of accidents involved sport cars on pie chart, see number of injured in every borough on map or simply ask for a whole column from data set. Imposing more restrictions can provide very specific and precise results which would be hard and time-consuming to obtain using traditional applications. ADA can speed up writing reports and articles, it can be also used for educational purposes. By developing the ontology on top of the data set, more advanced constraining and detailed results can be get.  For example, inserting knowledge about street names and neighborhoods would enable asking about accidents in particular region. Such ontology could be used for other data sets that contain data by street names, like a list of nurseries or Chinese restaurants.  

Ask Data Anything is an excellent tool to quickly analize large amount of data in a smart semantic way, with only basic ontology created in Fluent Editor. Cognitum's best specialists work on improving it to provide higher efficiency and more advanced features.

References

A quick overview on Ask Data Anything and its features:



New York Police Department's official website:
http://www.nyc.gov/html/nypd/html/home/home.shtml

Cognitum's official webite:
http://www.cognitum.eu/ 


1 comment:

  1. It's very important that Government agencies adopt to the latest innovation and technology that we have. It will be easy to access information such as the NYPD and the ones from only professional essay writers for hire.

    ReplyDelete