Udacity streaming data Nanodegree project: SF Crime Analisys

Project's goal / Summary

To create an hourly based, Crime type analisys with realtime events (San Francisco Police calls) using Kafka to produce these events and Spark Streaming to aggreate, filter, process and present the data.

Diagram

JSON -> KAFKA -> SPARK STREAMING

How to run

Once you have all the tasks completed, follow steps to run this project from the Udacity's workspace.

Start the Zookeeper service: /usr/bin/zookeeper-server-start config/zookeeper.properties

Screenshot:

Start the Kafka service: /usr/bin/kafka-server-start config/server.properties

Screenshot:

Start the Kafka producer: python kafka-server.py

Screenshoot:

Confirm events are been sent: kafka-console-consumer --bootstrap-server localhost:9092 --topic police.service.calls —from-beginning

Screenshot:

Run the Spark streaming script(*): spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.4 --master local[*] data_stream.py

Screenshot: (*) The data will be process and the result table will be displayed in the terminal, you may want to CTRL-C here since the terminal have a limited capacity. You can leave it running and see how the diferent batches show updated numbers.

During this process you will also see the progress reporter:

Batch 1 vs Batch 15

SparkUI: On the lower right section of the workspace screen, click on "PREVIEW" to access SparkUI

Screenshot:

Questions

How did changing values on the SparkSession property parameters affect the throughput and latency of the data?

This is directly attached to the overall througput performance, for example:

The parameter maxRatePerPartition Indicates thatthe maximum number of messages per partition, per batch that can be loaded.
The parameter maxOffsetsPerTrigger limit the number of records to fetch per trigger (not to be confused with max.poll.records)

What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

To ensure the values I was playing around were the most optimal I used the value of processedRowsPerSecond obtaining 341.8551515158866

The params I found more relevant:

spark.sql.shuffle.partitions
spark.streaming.kafka.maxRatePerPartition

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Archive.zip		Archive.zip
README.md		README.md
batch1-15.jpg		batch1-15.jpg
consumer.jpg		consumer.jpg
data_stream.py		data_stream.py
kafka-service.jpg		kafka-service.jpg
kafka_server.py		kafka_server.py
producer.jpg		producer.jpg
producer_server.py		producer_server.py
progress-reporter.jpg		progress-reporter.jpg
radio_code.json		radio_code.json
requirements.txt		requirements.txt
results-batch15.jpg		results-batch15.jpg
results.jpg		results.jpg
spark-schemas.jpg		spark-schemas.jpg
spark-ui.jpg		spark-ui.jpg
start.sh		start.sh
zookeeper-service.jpg		zookeeper-service.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Udacity streaming data Nanodegree project: SF Crime Analisys

Project's goal / Summary

Diagram

How to run

Questions

About

Releases

Packages

Languages

santiagomoneta/udacity-project2-sf-crime-data-analysis

Folders and files

Latest commit

History

Repository files navigation

Udacity streaming data Nanodegree project: SF Crime Analisys

Project's goal / Summary

Diagram

How to run

Questions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages