Skip to content

Add dependabot PR docs #613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from
Binary file added .gitbook/assets/image (1).png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .gitbook/assets/image (2).png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .gitbook/assets/image (3).png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .gitbook/assets/image (4).png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .gitbook/assets/image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
19 changes: 0 additions & 19 deletions GettingStarted.md

This file was deleted.

45 changes: 4 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,7 @@
# [The Philadelphia Animal Welfare Society (PAWS)](phillypaws.org)
# Developer Guide for PAWS data pipeline

As the city's largest animal rescue partner and no-kill animal shelter,
the [Philadelphia Animal Welfare Society (PAWS)](phillypaws.org) is working to make Philadelphia
a place where every healthy and treatable pet is guaranteed a home. Since inception over 10 years ago,
PAWS has rescued and placed 27,000+ animals in adoptive and foster homes, and has worked to prevent pet homelessness
by providing 86,000+ low-cost spay/neuter services and affordable vet care to 227,000+
clinic patients. PAWS is funded 100% through donations, with 91 cents of every dollar collected going
directly to the animals. Therefore, PAWS' rescue work (including 3 shelters and all rescue and
animal care programs), administration and development efforts are coordinated by only about
70 staff members complemented by over 1500 volunteers.
This is the PAWS data pipeline documentation for the developers and administrators working on the [PAWS PDP project](https://github.com/CodeForPhilly/paws-data-pipeline) or running instances of it. PAWS staff and other end users, please use the [end user documentation](https://paws-data-pipeline.gitbook.io/user/).

## [The Data Pipeline](https://codeforphilly.org/projects/paws_data_pipeline)
It is maintained in the GitBook format within the docs/ directory of the project's main Git repository, and new versions are published automatically upon commits or merges to the `documentation-dev` branch in [the project's GitHub repo](https://github.com/CodeForPhilly/paws-data-pipeline).

Through all of its operational and service activities, PAWS accumulates data regarding donations,
adoptions, fosters, volunteers, merchandise sales, event attendees (to name a few),
each in their own system and/or manual tally. This vital data that can
drive insights remains siloed and is usually difficult to extract, manipulate, and analyze.

This project provides PAWS with an easy-to-use and easy-to-support tool to extract
constituent data from multiple source systems, standardize extracted data, match constituents across data sources,
load relevant data into Salesforce, and run an automation in Salesforce to produce an RFM score.
Through these processes, the PAWS data pipeline has laid the groundwork for facilitating an up-to-date 360-degree view of PAWS constituents, and
flexible ongoing data analysis and insights discovery.

## Uses

- The pipeline can inform the PAWS development team of new constiuents through volunteer or foster engagegement
- Instead of manually matching constituents from volunteering, donations and foster/adoptions, PAWS staff only need to upload the volunteer dataset into the pipeline, and the pipeline handles the matching
- Volunteer and Foster data are automatically loaded into the constituent's SalesForce profile
- An RFM score is calculated for each constituent using the most recent data
- Data analyses can use the output of the PDP matching logic to join datasets from different sources; PAWS can benefit from such analyses in the following ways:
- PAWS operations can be better informed and use data-driven decisions to guide programs and maximize effectiveness;
- Supporters can be further engaged by suggesting additional opportunities for involvement based upon pattern analysis;
- Multi-dimensional supporters can be consistently (and accurately) acknowledged for all the ways they support PAWS (i.e. a volunteer who donates and also fosters kittens), not to mention opportunities to further tap the potential of these enthusiastic supporters.

## [Code of Conduct](https://codeforphilly.org/pages/code_of_conduct)

This is a Code for Philly project operating under their code of conduct.

## Links

[Slack Channel](https://codeforphilly.org/chat?channel=paws_data_pipeline)
[Wiki](https://github.com/CodeForPhilly/paws-data-pipeline/wiki)
To contribute to this book, please commit directly or open a pull request against the `documentation-dev` branch of github.com/CodeForPhilly/paws-data-pipeline.
27 changes: 27 additions & 0 deletions SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Table of contents

* [Overview](README.md)
* [Setup](setup/README.md)
* [Getting Started](setup/getting-started.md)
* [Local Setup](setup/local-setup.md)
* [Accessing APIs without React](setup/accessing-apis-without-react.md)
* [Architecture](architecture/README.md)
* [User management and authorization](architecture/user-management-and-authorization.md)
* [Async on the cheap (for MVP)](architecture/async-on-the-cheap-for-mvp.md)
* [Execution status stages](architecture/execution-status-stages.md)
* [Data Flow](architecture/data-flow.md)
* [Database Schema](architecture/database-schema.md)
* [Operations](deployment/README.md)
* [Using GitHub actions](deployment/using-github-actions.md)
* [Deploying PDP within the Code for Philly cluster](deployment/deploying-pdp-within-the-code-for-philly-cluster.md)
* [Kubernetes Setup](setup/kubernetes-setup.md)
* [Kubernetes logs](deployment/kubernetes-logs.md)
* [Merging Dependabot PRs](deployment/merging-dependabot-prs.md)
* [Troubleshooting](troubleshooting/README.md)
* [Common Errors](troubleshooting/common-errors.md)
* [Dups Problem](troubleshooting/dups-problem.md)
* [Code of Conduct](https://codeforphilly.org/pages/code_of_conduct)
* [Contributors](https://test.pawsdp.org/about)
* Glossary
* Archives
* [RFM](architecture/rfm.md)
3 changes: 3 additions & 0 deletions architecture/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Architecture

This section contains information on the architecture used for the PAWS data pipeline project.
34 changes: 34 additions & 0 deletions architecture/async-on-the-cheap-for-mvp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Async on the cheap (for MVP)

### Introduction

It's recognized \[1, 2] that the best way to handle long-running tasks is to use a task queue, allowing separation of the middle layer (API server) and the execution server. But as we're trying to get an MVP out for feedback, it's not unreasonable to use a less-than-perfect solution for the interim. Here's a few ideas for discussion:

### _Continue to treat execute() as synchronous but stream back status information_

We've been operating (at the API server) with a model of _receive request, do work, return() with data_. But both Flask and JS support streaming data in chunks from server to client:\
Flask: [Streaming Contents](https://flask.palletsprojects.com/en/1.1.x/patterns/streaming/)\
JS: [Using readable streams](https://developer.mozilla.org/en-US/docs/Web/API/Streams\_API/Using\_readable\_streams)\
\
From the Flask side, the data it streams back would be status updates (_e.g._, every 100 rows processed) which the React client would use to update the display. When the server sends back "complete", React displays a nice completion message and the user proceeds to the 360 view.

#### **Evaluation**

Doesn't appear to require much heavy lifting at server or client (we would need to figure out how to feed the generator on the server) but may be a bit brittle; if there's any kind of network hiccup (or user reloads the page?) the stream would be broken and we wouldn't be able to tell the user anything useful.

### _Client aborts Fetch, polls status API until completion_

In this idea, instead of waiting for the execute() Fetch to complete, the React client uses an [AbortController](https://developer.mozilla.org/en-US/docs/Web/API/AbortController/abort) to cancel the pending Fetch. It then starts polling the API execution status endpoint, displaying updates until that endpoint reports that the operation is complete.

**Evaluation**

Using SQLAlchemy's `engine.dispose()`, and two uWSGI processes. I've got `/api/get_execution_status/<job_id>` working correctly. I'd probably want to have it find the latest job

![](https://user-images.githubusercontent.com/11001850/112061042-4ceb9580-8b34-11eb-8dc7-fb9eede44d7d.png)

instead of having to specify it (although we could use the streaming model above to send back the job\_id). We need to figure what side-effects there might be to cancelling the fetch. I presume the browser would drop the connection; will Flask assume it can kill the request?\
The client could check status when the page loads to see if there's a running job so it would be more robust in the face of network issues or reloads.

\[1] [https://flask.palletsprojects.com/en/1.1.x/patterns/celery/](https://flask.palletsprojects.com/en/1.1.x/patterns/celery/)\
\[2] [https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-xxii-background-jobs](https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-xxii-background-jobs)

7 changes: 7 additions & 0 deletions architecture/data-flow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Data Flow

![](<../.gitbook/assets/image (2).png>)

[flow chart](https://app.lucidchart.com/invitations/accept/0602fccf-18f9-48d4-84ff-ffe5f0b03e7a)

**ShelterLuv People**: This data is being pulled via a script that calls ShelterLuv and saves data as a csv into a Dropbox folder via an "app". It is set up to use config + cron job, although this is not yet active in deployment. Every time it pulls data, it pulls everything because the API doesn't support pagination. To configure automation, the config file needs to contain the app ID
8 changes: 8 additions & 0 deletions architecture/database-schema.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Database Schema

TODO: fix link

[https://app.diagrams.net/#G1X4KbjYf7vcrfbeJLfyCj8xUPp8zGcV2k](https://app.diagrams.net/#G1X4KbjYf7vcrfbeJLfyCj8xUPp8zGcV2k)

[ Add a custom footer](https://github.com/CodeForPhilly/paws-data-pipeline/wiki/\_new?wiki%5Bname%5D=\_Footer)

5 changes: 5 additions & 0 deletions architecture/execution-status-stages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Execution status stages

The execution\_status table will be updated for a given job\_id through the stages in the diagram.

![](../.gitbook/assets/image.png)
68 changes: 68 additions & 0 deletions architecture/rfm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# RFM

## RFM Data Flows

![](<../.gitbook/assets/image (3).png>)

## RFM Database Tables

![](<../.gitbook/assets/image (4).png>)

## RFME Bin Logic

### Recency:

If a person's last donation was:

* the last 180 days: R = 5,
* 180-365 days ago: R = 4
* 365 - 728 days ago: R = 3,
* 728 - 1093 days ago: R = 2
* More than 0: R = 1
* Never given: R = 0

### Frequency:

If in the last 24 months someone has made a total of

* 24 or more donations: F = 5,
* 12 - 23 donations: F = 4
* 3 - 11 donations: F = 3
* 2 donations: F = 2;
* 1 donation: F = 1
* 0 donations: F = 0

### Monetary value:

If someone's cumulative giving in the past 24 months is

* $2001 ore more: M = 5
* $501 - $2000: M = 4
* $250 - $500: M = 3
* $101 - $249: M = 2
* $25 - $100 - $50: M = 1
* $0 - 25: M = 0

### the impact labels are as follows:

* High impact: (F+M)/2 is between 4-5
* Low impact: (F+M)/2 is between 1-3

### the engagement labels are as follows:

* engaged: R = 5
* slipping: R is 3-4
* disengaged: R is 1-2

### CAN WE INTEGRATE SCORING FOR FOSTERS/VOLUNTEERS?

"RFME" (E FOR ENGAGEMENT)

* volunteered or fostered in the past 30 days: E = 5
* volunteered or fostered in the past 6 months days: E = 4
* volunteered or fostered in the past year: E = 3
* volunteered or fostered in the past 2 years: E = 2
* volunteered or fostered ever: E = 1
* volunteered or fostered never: E = 0

(modified from Lauren's request of: E = 5 (CURRENT), E = 4 (WITHIN THE PAST YEAR), E = 3 (WITHIN THE PAST TWO YEARS), E = 2 (EVER), E = 0 (NEVER), because "1" value was missing and needed more specific definition of "current")
55 changes: 55 additions & 0 deletions architecture/user-management-and-authorization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# User management and authorization

### Intro

Because the 360 view gives access to sensitive personal information, we need to ensure that only authorized users can access PDP pages.

### Roles

There are three authorization levels/user roles:

* User: Can use the **Common API** to view 360 data but not make any changes
* Editor: User role plus can use the **Editor API** to manually link existing contacts
* Admin: Editor role plus can use the **Admin API** to upload data and manage users

### Login

Upon login, the user API shall return a JSON Web \[Access] Token (JWT) with a limited lifetime\[1]. The JWT includes the user's role.

### Authorization

The React client shall render only resources that are authorized by the current user's role. The React client shall present the JWT (using the **Authorization: Bearer** header) to the API server when making a request.\
The API server shall verify that user represented by the JWT is authorized to access the requested API endpoint. The server API shall return a 403 status if the user is not authorized to access the endpoint.

### Implementation

User roles are stored in the database `pdp_user_roles` table and per-user data is stored in the `pdp_users` table.

### API

**No authorization required**

| Endpoint | Description |
| --------------------- | --------------------------------- |
| `/api/user/test` | Liveness test, always returns 200 |
| `/api/user/test_fail` | Always fails with 401 |
| `/api/user/login` | Login |

**Valid JWT required**

| Endpoint | Description |
| --------------------- | ------------------------------------------- |
| `/api/user/test_auth` | Returns 200 if valid JWT presented |
| `/api/user/logout` | Logout (optional, as client can delete JWT) |

**Admin role required**

| Endpoint | Description |
| -------------------------------- | ------------------------------ |
| `/api/admin/user/create` | Create user |
| `/api/admin/user/get_user_count` | Get count of all users in DB |
| `/api/admin/user/get_users` | Get list of users with details |



\[1] _We need to decide on a lifetime that provides an appropriate balance between convenience and security. An expired Access token will require the user to login again. There is a Refresh-type token that allows automatic renewal of Access tokens without requiring the user to log in but the power of this kind of token poses additional security concerns._
3 changes: 3 additions & 0 deletions deployment/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Deployment

This section contains deployment instructions for the PAWS data pipeline project.
37 changes: 37 additions & 0 deletions deployment/deploying-pdp-within-the-code-for-philly-cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Deploying PDP within the Code for Philly cluster

## PDP hosting

The PAWS Data Pipeline runs on a Kubernetes cluster donated by [Linode](https://github.com/CodeForPhilly/paws-data-pipeline/wiki/www.linode.com) to the Code for Philly (CfP) project and is managed by the CfP [civic-cloud](https://forum.codeforphilly.org/c/public-development/civic-cloud/17) team.

The code and configurations for the various projects running on the cluster are managed using [hologit](https://github.com/JarvusInnovations/hologit) which

> _lets you declaratively define virtual sub-branches (called holobranches) within any Git branch that mix together content from their host branch, content from other repositories/branches, and executable-driven transformations._\[1]

The pieces for the sandbox clusters can be found in the `.holo` directory in the PDP repository and the [sandbox](https://github.com/CodeForPhilly/cfp-sandbox-cluster) or [live](https://github.com/CodeForPhilly/cfp-live-cluster) cluster repos as appropriate.

The branch (within the PDP repo) that holds the `.holo` directory is specified at [paws-data-pipeline.toml](https://github.com/CodeForPhilly/cfp-sandbox-cluster/blob/main/.holo/sources/paws-data-pipeline.toml).

RBAC roles and rights are defined at [admins](https://github.com/CodeForPhilly/cfp-sandbox-cluster/blob/main/admins/paws-data-pipeline.yaml).

### Updating deployed code

To deploy new code,

* Bump the image tag versions in **paws-data-pipeline/src/helm-chart/values.yaml** to the value you'll use for this deployment (e.g. v.2.3.4)
* Commit to master, tag with the above value, push to GitHub with --follow-tags
* Open a PR against [cfp-sandbox-cluster/.holo/sources/paws-data-pipeline.toml](https://github.com/CodeForPhilly/cfp-sandbox-cluster/blob/main/.holo/sources/paws-data-pipeline.toml) setting ref = "refs/tags/v2.3.4"
* The sysadmin folks hang out at [https://forum.codeforphilly.org/c/project-support-center/sysadmin/20](https://forum.codeforphilly.org/c/project-support-center/sysadmin/20) and you can ask for help there

### Ingress controller

CfP uses the [ingress-nginx](https://kubernetes.github.io/ingress-nginx) ingress controller (_not to be confused with an entirely different project called **nginx-ingress**_)

The list of settings can be found here: [Settings](https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/)\
To update settings, edit [release-values.yaml](https://github.com/CodeForPhilly/cfp-sandbox-cluster/blob/main/paws-data-pipeline/release-values.yaml) and create a pull request.

SSL cert configuration can also be found in [release-values.yaml](https://github.com/CodeForPhilly/cfp-sandbox-cluster/blob/main/paws-data-pipeline/release-values.yaml)



1. _“Any sufficiently advanced technology is indistinguishable from magic.”_ Arthur C. Clarke
8 changes: 8 additions & 0 deletions deployment/kubernetes-logs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Kubernetes logs

Database logs are visible by attaching to paws-datapipeline-db and viewing `/var/lib/postgresql/data/log/`

Since Kubernetes performs liveness tests, there are a lot of test lines in the logs which you'll want to filter out

* On paws-datapipeline-server, filter on "that don't match" `/api/user/test`
* On paws-datapipeline-client, filter on "that don't match" `GET /`
37 changes: 37 additions & 0 deletions deployment/merging-dependabot-prs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Dependabot PRs
- [Dependabot PRs](#dependabot-prs)
- [Frontend Dependabot PRs](#frontend-dependabot-prs)
- [Backend Dependabot PRs](#backend-dependabot-prs)

## Frontend Dependabot PRs
As the client facing part of the app is pretty minimal, this process should cover most frontend dependabot PRs.

- Pull dependabot PR
```
gh pr checkout [prNumber]
```

- Rebuild and run the container
```
docker-compose down -v
docker-compose build
docker-compose up
```

- Log into `base_admin` user
- Go to `Admin` page
- Upload 2 Volgistics data CSVs in the same upload action
- Click `Run Data Analysis`
- Go to `Users` page
- Create new user
- Update user via `Update User` button
- Change user password via `Change Password` button
- Go to 360 Dataview
- Search for a common name
- Click the user to make sure page renders correctly
- Log out

If the package patch notes look non-breaking and you encounter no errors in this process, the Dependabot PR should be safe to merge.

## Backend Dependabot PRs
Caution should be exercised when updating backend libraries.
9 changes: 9 additions & 0 deletions deployment/using-github-actions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Using GitHub actions

To run the CI/CD action:

* Ensure you have `release-containers.yml` in `/paws-data-pipeline/.github/workflows`
* Tag your code: `git tag -fa v1.4 -m "Still testing Actions"`
* Push with `git push -f --tags`

Check the [Actions](https://github.com/CodeForPhilly/paws-data-pipeline/actions) page to see the progress.
3 changes: 3 additions & 0 deletions setup/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Setup

This section contains setup instructions for the PAWS data pipeline project.
Loading