A toolset for mining and visualizing Git repositories with Social Network Analysis. ScrapLogGit2Net allows its users to scrape, model, and visualize social networks based on common source-code file edits for any given Git repository.
The toolset was first developed by Jose Apolinário Teixeira during his doctoral studies with some guidance from Software Engineering scholars with expertise in the mining of software repositories. The tool merits by considering both individuals and organizations. The tool maps developers to organizations by the commit email address and external APIs such as the REST and GraphQL ones provided by GitHub.
Newer features allow you to:
- Transform a network of individuals/individuals into a network of organizations/firms. The weighted edge between organizations is the sum of developers that worked together (i.e., co-edited the same source-code files).
- Filter developers by email (handy to deal with bots that commit code)
- Support for parallel edges (i.e., multiple edges between two nodes) that allow attributing weight to a cooperative relationship between two developers (e.g., the number of times they co-edited a source code file).
- Visualize collaborations dynamically using NetworkX is a Python package and Matplotlib: Visualization with Python.
The code was also recently (i.e., Spring 2024) made compliant with the NetworkX is a Python package data structures and the Python 3.10 version runtime which simplified the original codebase.
For more information, see the publication and related website:
- Teixeira, J., Robles, G., & González-Barahona, J. M. (2015). Lessons learned from applying social network analysis on an industrial Free/Libre/Open Source Software ecosystem. Journal of Internet Services and Applications, 6, 1-27. Available open-access at https://jisajournal.springeropen.com/articles/10.1186/s13174-015-0028-2.
- Website http://users.abo.fi/jteixeir/OpenStackSNA/ with the obtained social networks and visualizations included in publications by the author on the OpenStack software ecosystem.
- Website http://users.abo.fi/jteixeir/TensorFlowSNA/ with the obtained social networks and visualizations for the TensorFlow open and cooperative project (publication forthcoming).
Hard to figure out (visualize) who works with whom in complex software projects.
A world where software co-production analytics put social network visualizations at the side of standard quantitative statistical data. All towards the improved management and engineering of complex software projects orchestrated on Git.
Welcome to the project! We're excited to have you contribute to ScrapLogGit2Net. This guide will help you get started and ensure that your contributions are aligned with our project's standards.
- Setting Up Your Development Environment
- Coding Style
- Architecture
- Logging
- Progress Bars
- Git Workflow
- Easy Hacks
- Contact
-
Fork the repository on GitHub. To fork the ScrapLogGit2Net repository on GitHub, go to https://github.com/jaateixeira/ScrapLogGit2Net/. In the top right corner of the page, you will see a "Fork" button. Click on this button, and GitHub will create a copy of the repository under your GitHub account. This forked repository is now independent of the original repository, allowing you to freely make changes without affecting the original project. You can then clone your forked repository to your local machine, make your changes, and push them back to your fork on GitHub.
-
Clone your forked repository to your local machine:
git clone https://github.com/jaateixeira/ScrapLogGit2Net.git
-
Install the dependencies See dependencies.sh
We adopt the PEP 8 style guide towards writing clean, readable Python code.
ScrapLogGit2Net started as a quick script for scientific research. Quickly obtaining and processing data for research papers was the main goal. This is not a large, clean, object-oriented, test-driven masterpiece. Still, good principles for Python programming apply: (1) Follow naming conventions, (2) type-check your function parameters, and be careful with the use of global variables. Variables should have descriptive names in snake_case for readability and consistency. Type hints should be used in function definitions to specify expected input and output types, enhancing code clarity and facilitating debugging. Accessing global variables should be minimized, as it can lead to code that is difficult to understand and maintain. Instead, use function parameters and return values to manage data flow whenever possible, promoting modularity and reducing side effects.
PEP 8 is the style guide for Python code. It promotes readability and consistency in Python codebases. Following these guidelines will help improve the readability and maintainability of your code.
For the full PEP 8 documentation, please visit the official page: PEP 8 -- Style Guide for Python Code
flake8
is a tool that checks your Python code against the PEP 8 style guide. It helps identify and fix stylistic issues in your code.
To install flake8
, run the following command:
pip install flake8
flake8 your_module.py
Please use the built-in globals() function to access the global scope’s name table. This signals developers that we are dealing with an important global variable that we should not mess up with.
TODO Table
File | Variable | Type | Description |
---|---|---|---|
scrapLog.py | G_network_Dev2Dev_singleEdges | nx.Graph() | Inter-individual network - edges are unweighted |
scrapLog.py | G_network_Dev2Dev_multiEdges | nx.MultiGraph() | Inter-individual network - edges can be weighted |
scrapLog.py | stats | Dictionary (immutable keys) | Keeps statistics of the scraping |
formatFilterAndViz-nofi-GraphML.py | TODO | TODO | TODO |
transform-nofi-2-nofo-GraphML.py | TODO | TODO | TODO |
formatFilterAndViz-nofo-GraphML.py | TODO | TODO | TODO |
The ScrapLogGit2Net project leverages several powerful Python libraries to achieve its functionality. This section provides an overview of the key libraries used and how they fit into the project's architecture.
NumPy is used for numerical operations, including the creation and manipulation of arrays and matrices. It provides support for large multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently.
- Usage: NumPy is typically used for data manipulation, mathematical calculations, and handling large datasets.
- Documentation: NumPy Documentation
NetworkX is utilized for creating, manipulating, and studying the structure, dynamics, and functions of complex networks. It allows for the creation of both undirected and directed graphs, along with various algorithms to analyze them.
- Usage: NetworkX is used for constructing and analyzing network graphs, which is a core part of the project's functionality.
- Documentation: NetworkX Documentation
Matplotlib is a plotting library used for creating static, interactive, and animated visualizations in Python. It is heavily used for generating plots, charts, and other graphical representations of data.
- Usage: Matplotlib is used to visualize data, such as network graphs and other statistical plots.
- Documentation: Matplotlib Documentation
Rich is a library for rich text and beautiful formatting in the terminal. It is used to create aesthetically pleasing and user-friendly command-line interfaces with features like progress bars, tables, and syntax highlighting.
- Usage: Rich is used to enhance the terminal output, making it more informative and visually appealing, especially for progress indicators and formatted output.
- Documentation: Rich Documentation
Rich is used for:
- Printing colored text in the console (e.g., debug information)
- Printing text in Markdown format for better readability
- Printing emojis that reflect the state or mood in the console
- Inspect function to help you learn about objects
- Colored Logging in integration with Loguru
- Good-looking Tables
- Progress Bars and Wait Spinners
- Better Looking Errors, with colored stack traces
See video tutorial for more information.
Loguru is a library designed for simple and effective logging. It simplifies the process of logging by providing an easy-to-use and powerful logging mechanism.
- Usage: Loguru is used to handle logging throughout the project, ensuring that logs are informative, easy to read, and useful for debugging.
- Documentation: Loguru Documentation
The integration of these libraries follows a well-structured workflow:
- Data Handling: NumPy is used to preprocess and handle data efficiently.
- Network Construction: NetworkX is used to construct and manipulate network graphs from the data.
- Visualization: Matplotlib is used to create visual representations of the network graphs and other data.
- User Interface: Rich is used to create an enhanced command-line interface for better user interaction.
- Logging: Loguru is used throughout the project to log important information, errors, and debugging details.
By leveraging these libraries, ScrapLogGit2Net achieves a robust, efficient, and user-friendly architecture that simplifies complex data operations, network analysis, visualization, and interaction.
For more detailed guidelines on how to contribute to the project, please refer to the rest of the CONTRIBUTING.md
file.
Thank you for your contributions and helping improve ScrapLogGit2Net!
- Fork the Repository
- Clone the Forked Repository
- Create a Feature Branch
- Make Your Changes
- Commit Your Changes
- Push Your Feature Branch
- Create a Pull Request
- Respond to Feedback
- Go to the ScrapLogGit2Net repository on GitHub.
- Click the Fork button in the upper right corner of the page. This will create a copy of the repository under your GitHub account.
- Open your terminal or command prompt.
- Clone your forked repository to your local machine:
git clone https://github.com/your-username/ScrapLogGit2Net.git
Sync your fork:
git checkout main
git pull upstream main
git push origin main
Create a new branch for your feature/bugfix:
git checkout -b feature-branch
Make your changes and commit them:
git add .
git commit -m "Description of your changes"
Push your branch to GitHub:
git push origin feature-branch
- Go to your fork on GitHub.
- Click on the "New Pull Request" button.
- Select the base fork and branch (our repository's main branch) and compare it with your feature-branch.
- Create the pull request with a clear and detailed description of your changes.
Once you submit your pull request, it will be reviewed by the project maintainers. Here’s what to expect:
- Initial Review: We will review your code for adherence to the coding standards and overall implementation.
- Feedback: You might receive feedback or requests for changes.
- Approval: Once your pull request passes review, it will be merged into the main branch.
Please be responsive to feedback and make the necessary changes promptly to expedite the review process.
Jose Teixeira jose.teixeira@abo.fi
TODO