Get Started

1. Go to Github and follow the instructions to setup a local Yinyo instance

2. Run your first scraper

yinyo client test/scrapers/test-python --output data.sqlite

This will stream the console output of the scraper straight to you

-----> Python app detected
       !     Python has released a security update! Please consider upgrading to python-2.7.16
       Learn More: https://devcenter.heroku.com/articles/python-runtimes
-----> Installing requirements with pip
       Obtaining scraperwiki from git+http://github.com/openaustralia/scraperwiki-python.git@morph_defaults#egg=scraperwiki (from -r /tmp/build/requirements.txt (line 2))
       Cloning http://github.com/openaustralia/scraperwiki-python.git (to morph_defaults) to /app/.heroku/src/scraperwiki
       Installing collected packages: scraperwiki
       Running setup.py develop for scraperwiki
       Successfully installed scraperwiki

-----> Discovering process types
       Procfile declares types -> scraper
First a little test message to stderr
Hello from test-python!
1...
2...
3...
4...
5...

3. Do it all again! But this time using the API directly, step-by-step

1. Create a run

curl -X POST http://localhost:8080/runs

You'll get a name and a token back which you'll need in the following steps

{ "name": "run-qjv4t", "token": "lLsBCZiBPYcTQb439YvPbz9GC3bPcYr5" }

So, to make this a bit easier with less typing, let's set a couple of environment variables

NAME=run-qjv4t
TOKEN=lLsBCZiBPYcTQb439YvPbz9GC3bPcYr5

(Replace the run name and token with your own values)

2. Tar and compress the code

tar -C test/scrapers/test-python/ -zcf code.tgz .

3. Upload the code

curl -X PUT -H "Authorization: Bearer $TOKEN" "http://localhost:8080/runs/$NAME/app" --data-binary @code.tgz

4. Start the run

Note that we're also passing the path to the file that we want to get at the end of the run.

curl -X POST -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" "http://localhost:8080/runs/$NAME/start" -d '{"output":"data.sqlite"}'

5. Stream the events

curl -H "Authorization: Bearer $TOKEN" "http://localhost:8080/runs/$NAME/events"

This will output a stream of events formatted as JSON in real-time

{"id":"1580183421503-0","time":"2020-01-28T03:50:21.472365664Z","type":"start","data":{"stage":"build"}}
{"id":"1580183426973-0","time":"2020-01-28T03:50:26.925073211Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       \u001b[1G-----\u003e Python app detected"}}
{"id":"1580183431263-0","time":"2020-01-28T03:50:31.247105651Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       !     Python has released a security update! Please consider upgrading to python-2.7.16"}}
{"id":"1580183431275-0","time":"2020-01-28T03:50:31.264161902Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Learn More: https://devcenter.heroku.com/articles/python-runtimes"}}
{"id":"1580183431278-0","time":"2020-01-28T03:50:31.276357179Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G-----\u003e Installing python-2.7.15"}}
{"id":"1580183574302-0","time":"2020-01-28T03:52:54.291338196Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G-----\u003e Installing pip"}}
{"id":"1580183584476-0","time":"2020-01-28T03:53:04.474015597Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G-----\u003e Installing SQLite3"}}
{"id":"1580183619959-0","time":"2020-01-28T03:53:39.869423145Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G-----\u003e Installing requirements with pip"}}
{"id":"1580183620457-0","time":"2020-01-28T03:53:40.45369831Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Obtaining scraperwiki from git+http://github.com/openaustralia/scraperwiki-python.git@morph_defaults#egg=scraperwiki (from -r /tmp/build/requirements.txt (line 2))"}}
{"id":"1580183620511-0","time":"2020-01-28T03:53:40.462302015Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Cloning http://github.com/openaustralia/scraperwiki-python.git (to morph_defaults) to /app/.heroku/src/scraperwiki"}}
{"id":"1580183623775-0","time":"2020-01-28T03:53:43.77187282Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Collecting dumptruck\u003e=0.1.2 (from scraperwiki-\u003e-r /tmp/build/requirements.txt (line 2))"}}
{"id":"1580183624610-0","time":"2020-01-28T03:53:44.607800268Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Downloading https://files.pythonhosted.org/packages/15/27/3330a343de80d6849545b6c7723f8c9a08b4b104de964ac366e7e6b318df/dumptruck-0.1.6.tar.gz"}}
{"id":"1580183624915-0","time":"2020-01-28T03:53:44.90619589Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Collecting requests (from scraperwiki-\u003e-r /tmp/build/requirements.txt (line 2))"}}
{"id":"1580183625316-0","time":"2020-01-28T03:53:45.285716896Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB)"}}
{"id":"1580183625486-0","time":"2020-01-28T03:53:45.476952262Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Collecting urllib3!=1.25.0,!=1.25.1,\u003c1.26,\u003e=1.21.1 (from requests-\u003escraperwiki-\u003e-r /tmp/build/requirements.txt (line 2))"}}
{"id":"1580183625818-0","time":"2020-01-28T03:53:45.816865328Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Downloading https://files.pythonhosted.org/packages/e8/74/6e4f91745020f967d09332bb2b8b9b10090957334692eb88ea4afe91b77f/urllib3-1.25.8-py2.py3-none-any.whl (125kB)"}}
{"id":"1580183626043-0","time":"2020-01-28T03:53:46.036071959Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Collecting certifi\u003e=2017.4.17 (from requests-\u003escraperwiki-\u003e-r /tmp/build/requirements.txt (line 2))"}}
{"id":"1580183626355-0","time":"2020-01-28T03:53:46.349999527Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Downloading https://files.pythonhosted.org/packages/b9/63/df50cac98ea0d5b006c55a399c3bf1db9da7b5a24de7890bc9cfd5dd9e99/certifi-2019.11.28-py2.py3-none-any.whl (156kB)"}}
{"id":"1580183626572-0","time":"2020-01-28T03:53:46.566282329Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Collecting chardet\u003c3.1.0,\u003e=3.0.2 (from requests-\u003escraperwiki-\u003e-r /tmp/build/requirements.txt (line 2))"}}
{"id":"1580183626881-0","time":"2020-01-28T03:53:46.865673118Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Downloading https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl (133kB)"}}
{"id":"1580183627068-0","time":"2020-01-28T03:53:47.064912201Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Collecting idna\u003c2.9,\u003e=2.5 (from requests-\u003escraperwiki-\u003e-r /tmp/build/requirements.txt (line 2))"}}
{"id":"1580183627365-0","time":"2020-01-28T03:53:47.359353914Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Downloading https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl (58kB)"}}
{"id":"1580183627480-0","time":"2020-01-28T03:53:47.459853005Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Installing collected packages: dumptruck, urllib3, certifi, chardet, idna, requests, scraperwiki"}}
{"id":"1580183627487-0","time":"2020-01-28T03:53:47.481636522Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Running setup.py install for dumptruck: started"}}
{"id":"1580183627810-0","time":"2020-01-28T03:53:47.759041216Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Running setup.py install for dumptruck: finished with status 'done'"}}
{"id":"1580183628160-0","time":"2020-01-28T03:53:48.156023412Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Running setup.py develop for scraperwiki"}}
{"id":"1580183628416-0","time":"2020-01-28T03:53:48.412788154Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Successfully installed certifi-2019.11.28 chardet-3.0.4 dumptruck-0.1.6 idna-2.8 requests-2.22.0 scraperwiki urllib3-1.25.8"}}
{"id":"1580183628838-0","time":"2020-01-28T03:53:48.830730376Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       "}}
{"id":"1580183629805-0","time":"2020-01-28T03:53:49.771548419Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       \u001b[1G-----\u003e Discovering process types"}}
{"id":"1580183629813-0","time":"2020-01-28T03:53:49.80782406Z","type":"log","data":{"stage":"build","stream":"stdout","text":"\u001b[1G       Procfile declares types -\u003e scraper"}}
{"id":"1580183629853-0","time":"2020-01-28T03:53:49.832331568Z","type":"finish","data":{"stage":"build","exit_data":{"exit_code":0,"usage":{"wall_time":208.322458275,"cpu_time":23.252974000000002,"max_rss":77045760,"network_in":50109438,"network_out":1307044}}}}
{"id":"1580183638633-0","time":"2020-01-28T03:53:58.625907685Z","type":"start","data":{"stage":"run"}}
{"id":"1580183641377-0","time":"2020-01-28T03:54:01.368828538Z","type":"log","data":{"stage":"run","stream":"stdout","text":"Hello from test-python!"}}
{"id":"1580183641386-0","time":"2020-01-28T03:54:01.380169548Z","type":"log","data":{"stage":"run","stream":"stdout","text":"1..."}}
{"id":"1580183641411-0","time":"2020-01-28T03:54:01.398367334Z","type":"log","data":{"stage":"run","stream":"stderr","text":"First a little test message to stderr"}}
{"id":"1580183642374-0","time":"2020-01-28T03:54:02.368949967Z","type":"log","data":{"stage":"run","stream":"stdout","text":"2..."}}
{"id":"1580183643372-0","time":"2020-01-28T03:54:03.370619368Z","type":"log","data":{"stage":"run","stream":"stdout","text":"3..."}}
{"id":"1580183644373-0","time":"2020-01-28T03:54:04.371837081Z","type":"log","data":{"stage":"run","stream":"stdout","text":"4..."}}
{"id":"1580183645376-0","time":"2020-01-28T03:54:05.37319053Z","type":"log","data":{"stage":"run","stream":"stdout","text":"5..."}}
{"id":"1580183646397-0","time":"2020-01-28T03:54:06.38666964Z","type":"finish","data":{"stage":"run","exit_data":{"exit_code":0,"usage":{"wall_time":7.75252631,"cpu_time":0.384125,"max_rss":136421376,"network_in":28585,"network_out":7125}}}}
{"id":"1580183648184-0","time":"2020-01-28T03:54:08.182780901Z","type":"last","data":{}}

You might notice that this is taking longer than when we ran this with yinyo client. It's having to install python and some dependencies which takes some time. That's because we've ignored caching here just to keep things a bit simpler

6. Get the output

Now get the output file which we chose when we started the run and save it to a local file called data.sqlite.

curl -H "Authorization: Bearer $TOKEN" "http://localhost:8080/runs/$NAME/output" --output data.sqlite

7. Clean up

curl -X DELETE -H "Authorization: Bearer $TOKEN" "http://localhost:8080/runs/$NAME"

4. Check out the API reference to see what more you can do