CodeForPhilly · carrollsa · Nov 30, 2022 · Dec 2, 2022 · Dec 6, 2022 · Dec 6, 2022
diff --git a/.gitbook/assets/image (1).png b/.gitbook/assets/image (1).png
diff --git a/.gitbook/assets/image (2).png b/.gitbook/assets/image (2).png
diff --git a/.gitbook/assets/image (3).png b/.gitbook/assets/image (3).png
diff --git a/.gitbook/assets/image (4).png b/.gitbook/assets/image (4).png
diff --git a/.gitbook/assets/image.png b/.gitbook/assets/image.png
diff --git a/GettingStarted.md b/GettingStarted.md
diff --git a/README.md b/README.md
@@ -1,44 +1,7 @@
-# [The Philadelphia Animal Welfare Society (PAWS)](phillypaws.org)
+# Developer Guide for PAWS data pipeline
 
-As the city's largest animal rescue partner and no-kill animal shelter, 
-the [Philadelphia Animal Welfare Society (PAWS)](phillypaws.org) is working to make Philadelphia 
-a place where every healthy and treatable pet is guaranteed a home. Since inception over 10 years ago, 
-PAWS has rescued and placed 27,000+ animals in adoptive and foster homes, and has worked to prevent pet homelessness 
-by providing 86,000+ low-cost spay/neuter services and affordable vet care to 227,000+ 
-clinic patients. PAWS is funded 100% through donations, with 91 cents of every dollar collected going 
-directly to the animals. Therefore, PAWS' rescue work (including 3 shelters and all rescue and 
-animal care programs), administration and development efforts are coordinated by only about 
-70 staff members complemented by over 1500 volunteers.
+This is the PAWS data pipeline documentation for the developers and administrators working on the [PAWS PDP project](https://github.com/CodeForPhilly/paws-data-pipeline)  or running instances of it. PAWS staff and other end users, please use the [end user documentation](https://paws-data-pipeline.gitbook.io/user/).
 
-## [The Data Pipeline](https://codeforphilly.org/projects/paws_data_pipeline)
+It is maintained in the GitBook format within the docs/ directory of the project's main Git repository, and new versions are published automatically upon commits or merges to the `documentation-dev` branch in [the project's GitHub repo](https://github.com/CodeForPhilly/paws-data-pipeline). 
 
-Through all of its operational and service activities, PAWS accumulates data regarding donations, 
-adoptions, fosters, volunteers, merchandise sales, event attendees (to name a few), 
-each in their own system and/or manual tally. This vital data that can 
-drive insights remains siloed and is usually difficult to extract, manipulate, and analyze. 
-
-This project provides PAWS with an easy-to-use and easy-to-support tool to extract 
-constituent data from multiple source systems, standardize extracted data, match constituents across data sources,  
-load relevant data into Salesforce, and run an automation in Salesforce to produce an RFM score. 
-Through these processes, the PAWS data pipeline has laid the groundwork for facilitating an up-to-date 360-degree view of PAWS constituents, and 
-flexible ongoing data analysis and insights discovery.
-
-## Uses 
-
-- The pipeline can inform the PAWS development team of new constiuents through volunteer or foster engagegement
-- Instead of manually matching constituents from volunteering, donations and foster/adoptions, PAWS staff only need to upload the volunteer dataset into the pipeline, and the pipeline handles the matching
-- Volunteer and Foster data are automatically loaded into the constituent's SalesForce profile
-- An RFM score is calculated for each constituent using the most recent data 
-- Data analyses can use the output of the PDP matching logic to join datasets from different sources; PAWS can benefit from such analyses in the following ways: 
-    - PAWS operations can be better informed and use data-driven decisions to guide programs and maximize effectiveness;  
-    - Supporters can be further engaged by suggesting additional opportunities for involvement based upon pattern analysis;  
-    - Multi-dimensional supporters can be consistently (and accurately) acknowledged for all the ways they support PAWS (i.e. a volunteer who donates and also fosters kittens), not to mention opportunities to further tap the potential of these enthusiastic supporters.
-
-## [Code of Conduct](https://codeforphilly.org/pages/code_of_conduct)
-
-This is a Code for Philly project operating under their code of conduct. 
-
-## Links
-
-[Slack Channel](https://codeforphilly.org/chat?channel=paws_data_pipeline)  
-[Wiki](https://github.com/CodeForPhilly/paws-data-pipeline/wiki)
+To contribute to this book, please commit directly or open a pull request against the `documentation-dev` branch of github.com/CodeForPhilly/paws-data-pipeline.
diff --git a/SUMMARY.md b/SUMMARY.md
@@ -0,0 +1,27 @@
+# Table of contents
+
+* [Overview](README.md)
+* [Setup](setup/README.md)
+  * [Getting Started](setup/getting-started.md)
+  * [Local Setup](setup/local-setup.md)
+  * [Accessing APIs without React](setup/accessing-apis-without-react.md)
+* [Architecture](architecture/README.md)
+  * [User management and authorization](architecture/user-management-and-authorization.md)
+  * [Async on the cheap (for MVP)](architecture/async-on-the-cheap-for-mvp.md)
+  * [Execution status stages](architecture/execution-status-stages.md)
+  * [Data Flow](architecture/data-flow.md)
+  * [Database Schema](architecture/database-schema.md)
+* [Operations](deployment/README.md)
+  * [Using GitHub actions](deployment/using-github-actions.md)
+  * [Deploying PDP within the Code for Philly cluster](deployment/deploying-pdp-within-the-code-for-philly-cluster.md)
+  * [Kubernetes Setup](setup/kubernetes-setup.md)
+  * [Kubernetes logs](deployment/kubernetes-logs.md)
+  * [Merging Dependabot PRs](deployment/merging-dependabot-prs.md)
+* [Troubleshooting](troubleshooting/README.md)
+  * [Common Errors](troubleshooting/common-errors.md)
+  * [Dups Problem](troubleshooting/dups-problem.md)
+* [Code of Conduct](https://codeforphilly.org/pages/code_of_conduct)
+* [Contributors](https://test.pawsdp.org/about)
+* Glossary
+* Archives
+  * [RFM](architecture/rfm.md)
diff --git a/architecture/README.md b/architecture/README.md
@@ -0,0 +1,3 @@
+# Architecture
+
+This section contains information on the architecture used for the PAWS data pipeline project.
diff --git a/architecture/async-on-the-cheap-for-mvp.md b/architecture/async-on-the-cheap-for-mvp.md
@@ -0,0 +1,34 @@
+# Async on the cheap (for MVP)
+
+### Introduction
+
+It's recognized \[1, 2] that the best way to handle long-running tasks is to use a task queue, allowing separation of the middle layer (API server) and the execution server. But as we're trying to get an MVP out for feedback, it's not unreasonable to use a less-than-perfect solution for the interim. Here's a few ideas for discussion:
+
+### _Continue to treat execute() as synchronous but stream back status information_
+
+We've been operating (at the API server) with a model of _receive request, do work, return() with data_. But both Flask and JS support streaming data in chunks from server to client:\
+Flask: [Streaming Contents](https://flask.palletsprojects.com/en/1.1.x/patterns/streaming/)\
+JS: [Using readable streams](https://developer.mozilla.org/en-US/docs/Web/API/Streams\_API/Using\_readable\_streams)\
+\
+From the Flask side, the data it streams back would be status updates (_e.g._, every 100 rows processed) which the React client would use to update the display. When the server sends back "complete", React displays a nice completion message and the user proceeds to the 360 view.
+
+#### **Evaluation**
+
+Doesn't appear to require much heavy lifting at server or client (we would need to figure out how to feed the generator on the server) but may be a bit brittle; if there's any kind of network hiccup (or user reloads the page?) the stream would be broken and we wouldn't be able to tell the user anything useful.
+
+### _Client aborts Fetch, polls status API until completion_
+
+In this idea, instead of waiting for the execute() Fetch to complete, the React client uses an [AbortController](https://developer.mozilla.org/en-US/docs/Web/API/AbortController/abort) to cancel the pending Fetch. It then starts polling the API execution status endpoint, displaying updates until that endpoint reports that the operation is complete.
+
+**Evaluation**
+
+Using SQLAlchemy's `engine.dispose()`, and two uWSGI processes. I've got `/api/get_execution_status/<job_id>` working correctly. I'd probably want to have it find the latest job
+
+![](https://user-images.githubusercontent.com/11001850/112061042-4ceb9580-8b34-11eb-8dc7-fb9eede44d7d.png)
+
+instead of having to specify it (although we could use the streaming model above to send back the job\_id). We need to figure what side-effects there might be to cancelling the fetch. I presume the browser would drop the connection; will Flask assume it can kill the request?\
+The client could check status when the page loads to see if there's a running job so it would be more robust in the face of network issues or reloads.
+
+\[1] [https://flask.palletsprojects.com/en/1.1.x/patterns/celery/](https://flask.palletsprojects.com/en/1.1.x/patterns/celery/)\
+\[2] [https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-xxii-background-jobs](https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-xxii-background-jobs)
+
diff --git a/architecture/data-flow.md b/architecture/data-flow.md
@@ -0,0 +1,7 @@
+# Data Flow
+
+![](<../.gitbook/assets/image (2).png>)
+
+[flow chart](https://app.lucidchart.com/invitations/accept/0602fccf-18f9-48d4-84ff-ffe5f0b03e7a)
+
+**ShelterLuv People**: This data is being pulled via a script that calls ShelterLuv and saves data as a csv into a Dropbox folder via an "app". It is set up to use config + cron job, although this is not yet active in deployment. Every time it pulls data, it pulls everything because the API doesn't support pagination. To configure automation, the config file needs to contain the app ID
diff --git a/architecture/database-schema.md b/architecture/database-schema.md
@@ -0,0 +1,8 @@
+# Database Schema
+
+TODO: fix link
+
+[https://app.diagrams.net/#G1X4KbjYf7vcrfbeJLfyCj8xUPp8zGcV2k](https://app.diagrams.net/#G1X4KbjYf7vcrfbeJLfyCj8xUPp8zGcV2k)
+
+[ Add a custom footer](https://github.com/CodeForPhilly/paws-data-pipeline/wiki/\_new?wiki%5Bname%5D=\_Footer)
+
diff --git a/architecture/execution-status-stages.md b/architecture/execution-status-stages.md
@@ -0,0 +1,5 @@
+# Execution status stages
+
+The execution\_status table will be updated for a given job\_id through the stages in the diagram.
+
+![](../.gitbook/assets/image.png)
diff --git a/architecture/rfm.md b/architecture/rfm.md
@@ -0,0 +1,68 @@
+# RFM
+
+## RFM Data Flows
+
+![](<../.gitbook/assets/image (3).png>)
+
+## RFM Database Tables
+
+![](<../.gitbook/assets/image (4).png>)
+
+## RFME Bin Logic
+
+### Recency:
+
+If a person's last donation was:
+
+* the last 180 days: R = 5,
+* 180-365 days ago: R = 4
+* 365 - 728 days ago: R = 3,
+* 728 - 1093 days ago: R = 2
+* More than 0: R = 1
+* Never given: R = 0
+
+### Frequency:
+
+If in the last 24 months someone has made a total of
+
+* 24 or more donations: F = 5,
+* 12 - 23 donations: F = 4
+* 3 - 11 donations: F = 3
+* 2 donations: F = 2;
+* 1 donation: F = 1
+* 0 donations: F = 0
+
+### Monetary value:
+
+If someone's cumulative giving in the past 24 months is
+
+* $2001 ore more: M = 5
+* $501 - $2000: M = 4
+* $250 - $500: M = 3
+* $101 - $249: M = 2
+* $25 - $100 - $50: M = 1
+* $0 - 25: M = 0
+
+### the impact labels are as follows:
+
+* High impact: (F+M)/2 is between 4-5
+* Low impact: (F+M)/2 is between 1-3
+
+### the engagement labels are as follows:
+
+* engaged: R = 5
+* slipping: R is 3-4
+* disengaged: R is 1-2
+
+### CAN WE INTEGRATE SCORING FOR FOSTERS/VOLUNTEERS?
+
+"RFME" (E FOR ENGAGEMENT)
+
+* volunteered or fostered in the past 30 days: E = 5
+* volunteered or fostered in the past 6 months days: E = 4
+* volunteered or fostered in the past year: E = 3
+* volunteered or fostered in the past 2 years: E = 2
+* volunteered or fostered ever: E = 1
+* volunteered or fostered never: E = 0
+
+(modified from Lauren's request of: E = 5 (CURRENT), E = 4 (WITHIN THE PAST YEAR), E = 3 (WITHIN THE PAST TWO YEARS), E = 2 (EVER), E = 0 (NEVER), because "1" value was missing and needed more specific definition of "current")
diff --git a/architecture/user-management-and-authorization.md b/architecture/user-management-and-authorization.md
@@ -0,0 +1,55 @@
+# User management and authorization
+
+### Intro
+
+Because the 360 view gives access to sensitive personal information, we need to ensure that only authorized users can access PDP pages.
+
+### Roles
+
+There are three authorization levels/user roles:
+
+* User: Can use the **Common API** to view 360 data but not make any changes
+* Editor: User role plus can use the **Editor API** to manually link existing contacts
+* Admin: Editor role plus can use the **Admin API** to upload data and manage users
+
+### Login
+
+Upon login, the user API shall return a JSON Web \[Access] Token (JWT) with a limited lifetime\[1]. The JWT includes the user's role.
+
+### Authorization
+
+The React client shall render only resources that are authorized by the current user's role. The React client shall present the JWT (using the **Authorization: Bearer** header) to the API server when making a request.\
+The API server shall verify that user represented by the JWT is authorized to access the requested API endpoint. The server API shall return a 403 status if the user is not authorized to access the endpoint.
+
+### Implementation
+
+User roles are stored in the database `pdp_user_roles` table and per-user data is stored in the `pdp_users` table.
+
+### API
+
+**No authorization required**
+
+| Endpoint              | Description                       |
+| --------------------- | --------------------------------- |
+| `/api/user/test`      | Liveness test, always returns 200 |
+| `/api/user/test_fail` | Always fails with 401             |
+| `/api/user/login`     | Login                             |
+
+**Valid JWT required**
+
+| Endpoint              | Description                                 |
+| --------------------- | ------------------------------------------- |
+| `/api/user/test_auth` | Returns 200 if valid JWT presented          |
+| `/api/user/logout`    | Logout (optional, as client can delete JWT) |
+
+**Admin role required**
+
+| Endpoint                         | Description                    |
+| -------------------------------- | ------------------------------ |
+| `/api/admin/user/create`         | Create user                    |
+| `/api/admin/user/get_user_count` | Get count of all users in DB   |
+| `/api/admin/user/get_users`      | Get list of users with details |
+
+
+
+\[1] _We need to decide on a lifetime that provides an appropriate balance between convenience and security. An expired Access token will require the user to login again. There is a Refresh-type token that allows automatic renewal of Access tokens without requiring the user to log in but the power of this kind of token poses additional security concerns._
diff --git a/deployment/README.md b/deployment/README.md
@@ -0,0 +1,3 @@
+# Deployment
+
+This section contains deployment instructions for the PAWS data pipeline project.
diff --git a/deployment/deploying-pdp-within-the-code-for-philly-cluster.md b/deployment/deploying-pdp-within-the-code-for-philly-cluster.md
@@ -0,0 +1,37 @@
+# Deploying PDP within the Code for Philly cluster
+
+## PDP hosting
+
+The PAWS Data Pipeline runs on a Kubernetes cluster donated by [Linode](https://github.com/CodeForPhilly/paws-data-pipeline/wiki/www.linode.com) to the Code for Philly (CfP) project and is managed by the CfP [civic-cloud](https://forum.codeforphilly.org/c/public-development/civic-cloud/17) team.
+
+The code and configurations for the various projects running on the cluster are managed using [hologit](https://github.com/JarvusInnovations/hologit) which
+
+> _lets you declaratively define virtual sub-branches (called holobranches) within any Git branch that mix together content from their host branch, content from other repositories/branches, and executable-driven transformations._\[1]
+
+The pieces for the sandbox clusters can be found in the `.holo` directory in the PDP repository and the [sandbox](https://github.com/CodeForPhilly/cfp-sandbox-cluster) or [live](https://github.com/CodeForPhilly/cfp-live-cluster) cluster repos as appropriate.
+
+The branch (within the PDP repo) that holds the `.holo` directory is specified at [paws-data-pipeline.toml](https://github.com/CodeForPhilly/cfp-sandbox-cluster/blob/main/.holo/sources/paws-data-pipeline.toml).
+
+RBAC roles and rights are defined at [admins](https://github.com/CodeForPhilly/cfp-sandbox-cluster/blob/main/admins/paws-data-pipeline.yaml).
+
+### Updating deployed code
+
+To deploy new code,
+
+* Bump the image tag versions in **paws-data-pipeline/src/helm-chart/values.yaml** to the value you'll use for this deployment (e.g. v.2.3.4)
+* Commit to master, tag with the above value, push to GitHub with --follow-tags
+* Open a PR against [cfp-sandbox-cluster/.holo/sources/paws-data-pipeline.toml](https://github.com/CodeForPhilly/cfp-sandbox-cluster/blob/main/.holo/sources/paws-data-pipeline.toml) setting ref = "refs/tags/v2.3.4"
+* The sysadmin folks hang out at [https://forum.codeforphilly.org/c/project-support-center/sysadmin/20](https://forum.codeforphilly.org/c/project-support-center/sysadmin/20) and you can ask for help there
+
+### Ingress controller
+
+CfP uses the [ingress-nginx](https://kubernetes.github.io/ingress-nginx) ingress controller (_not to be confused with an entirely different project called **nginx-ingress**_)
+
+The list of settings can be found here: [Settings](https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/)\
+To update settings, edit [release-values.yaml](https://github.com/CodeForPhilly/cfp-sandbox-cluster/blob/main/paws-data-pipeline/release-values.yaml) and create a pull request.
+
+SSL cert configuration can also be found in [release-values.yaml](https://github.com/CodeForPhilly/cfp-sandbox-cluster/blob/main/paws-data-pipeline/release-values.yaml)
+
+
+
+1. _“Any sufficiently advanced technology is indistinguishable from magic.”_ Arthur C. Clarke
diff --git a/deployment/kubernetes-logs.md b/deployment/kubernetes-logs.md
@@ -0,0 +1,8 @@
+# Kubernetes logs
+
+Database logs are visible by attaching to paws-datapipeline-db and viewing `/var/lib/postgresql/data/log/`
+
+Since Kubernetes performs liveness tests, there are a lot of test lines in the logs which you'll want to filter out
+
+* On paws-datapipeline-server, filter on "that don't match" `/api/user/test`
+* On paws-datapipeline-client, filter on "that don't match" `GET /`
diff --git a/deployment/merging-dependabot-prs.md b/deployment/merging-dependabot-prs.md
@@ -0,0 +1,37 @@
+# Dependabot PRs
+- [Dependabot PRs](#dependabot-prs)
+  - [Frontend Dependabot PRs](#frontend-dependabot-prs)
+  - [Backend Dependabot PRs](#backend-dependabot-prs)
+
+## Frontend Dependabot PRs
+As the client facing part of the app is pretty minimal, this process should cover most frontend dependabot PRs.
+
+- Pull dependabot PR
+```
+gh pr checkout [prNumber]
+```
+
+- Rebuild and run the container
+```
+docker-compose down -v
+docker-compose build
+docker-compose up
+```
+
+- Log into `base_admin` user
+- Go to `Admin` page
+  - Upload 2 Volgistics data CSVs in the same upload action
+  - Click `Run Data Analysis`
+- Go to `Users` page
+  - Create new user
+  - Update user via `Update User` button
+  - Change user password via `Change Password` button
+- Go to 360 Dataview
+  - Search for a common name
+  - Click the user to make sure page renders correctly
+- Log out
+
+If the package patch notes look non-breaking and you encounter no errors in this process, the Dependabot PR should be safe to merge.
+
+## Backend Dependabot PRs
+Caution should be exercised when updating backend libraries.
diff --git a/deployment/using-github-actions.md b/deployment/using-github-actions.md
@@ -0,0 +1,9 @@
+# Using GitHub actions
+
+To run the CI/CD action:
+
+* Ensure you have `release-containers.yml` in `/paws-data-pipeline/.github/workflows`
+* Tag your code: `git tag -fa v1.4 -m "Still testing Actions"`
+* Push with `git push -f --tags`
+
+Check the [Actions](https://github.com/CodeForPhilly/paws-data-pipeline/actions) page to see the progress.
diff --git a/setup/README.md b/setup/README.md
@@ -0,0 +1,3 @@
+# Setup
+
+This section contains setup instructions for the PAWS data pipeline project.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# Architecture

		This section contains information on the architecture used for the PAWS data pipeline project.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# Deployment

		This section contains deployment instructions for the PAWS data pipeline project.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# Setup

		This section contains setup instructions for the PAWS data pipeline project.