data-engineering

Pre- requisites

Python3 & Pip is installed
Venv is created

Solution to Question on User Interaction CSV

Python Command to run program

source .venv/bin/activate
python3 -m pip install wheel
python3 -m pip install pandas
python process_csv_user_interaction.py  user_visited.csv output.csv

Explanation

process_csv_user_interaction.py is the program that takes inputs csv and output csv file
It generates the result as below
- Total Interactions
- Total Unique Users
- Most Visited URL
- Average Time Spent on Each URL

Solution to Question on Online Purchases Interaction CSV

Python Command to run program

source .venv/bin/activate
python3 -m pip install wheel
python3 -m pip install pandas
python process_csv_online_purchases.py customer_input.txt output1.csv

Explanation

process_csv_online_purchases.py is the program that takes inputs csv and output csv file
It generates the result as below
- date (str): Date of the transaction in the format 'YYYY-MM-DD'.
- customer_id (int): Unique identifier for each customer.
- total_spent (float): Total amount spent by the customer on that date.

Solution to Question on Situation Based

Explanation

I am assuming we are interested in designing Batch Processing Pipeline

Process Involved for designing the pipeline would be as follows:

Extract > Transform > Load

The folloing components (AWS managed Services) might possibly be used:

Kinesis Streaming Agent -> Kinesis Data Firehose ->  Transform Records -> Save Intermediate records in S3 -> AWS Lamdba -> AWS Redshift 
                                        |
                                        |->   AWS Lambda (Save source data in S3)


- Kinesis Streaming Agent/Producer Library on the website pulling data from database & pushing to Kinesis Streams/Data Firehose [Extract]
- Kinesis Data Firehose - [Transform]
    - AWS Lambda (Save source data in S3) 
    - Kinesis Data Firehose - Transform Records  
    - Kinesis Data Firehose - Save Intermediate records in S3 
    - AWS Lambda - Load Transformed Records in AWS Redshift 
- Data base/ Blob Storage - AWS Redshift - [Load]]
- Analytic Tools - to provide an interface/ option to generate reports by providing queries to view analytics data and derive insights

Availability of managed services are guaranteed by AWS and can be bought as desired (across diff datacenter/availability zones)
Reliability is built into the solution by saving intermediate records/failed transformed records in S3
If we were to make the processing more scalable, we could consider limiting the size of batch data at the source and process them. Also we could increase instances of Fire Hose Service/S3 Bucket for parallel processing of data.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
HelloWord.py		HelloWord.py
README.md		README.md
computer-networks-assignment.md		computer-networks-assignment.md
customer_input.txt		customer_input.txt
output.csv		output.csv
output1.csv		output1.csv
plot.py		plot.py
process_csv_online_purchases.py		process_csv_online_purchases.py
process_csv_user_interaction.py		process_csv_user_interaction.py
situation_based_question.txt		situation_based_question.txt
user_visited.csv		user_visited.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-engineering

Pre- requisites

Solution to Question on User Interaction CSV

Python Command to run program

Explanation

Solution to Question on Online Purchases Interaction CSV

Python Command to run program

Explanation

Solution to Question on Situation Based

Explanation

About

Releases

Packages

Languages

pandeyhere/data-engineering

Folders and files

Latest commit

History

Repository files navigation

data-engineering

Pre- requisites

Solution to Question on User Interaction CSV

Python Command to run program

Explanation

Solution to Question on Online Purchases Interaction CSV

Python Command to run program

Explanation

Solution to Question on Situation Based

Explanation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages