-
Notifications
You must be signed in to change notification settings - Fork 38
/
captions.sbv
1203 lines (883 loc) · 30.4 KB
/
captions.sbv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
0:00:03.900,0:00:08.460
good morning everybody David Shapiro
here we are back for part two of the
0:00:08.460,0:00:15.780
um Knowledge Graph Supreme Court decision thing
so we're using chat GPT and gpt3 to help us along
0:00:16.680,0:00:25.980
and as a quick recap where we left off was um
I didn't get that far um but I downloaded about
0:00:25.980,0:00:31.620
22 Supreme Court opinions this was about
Anti-Trust law so a very specific domain
0:00:32.700,0:00:40.260
um and then I got them converted to text so that
is um I did that with another uh repo it's fine
0:00:40.260,0:00:47.100
but then what we did was the the big thing that we
achieved was that we figured out that we can get
0:00:47.100,0:00:54.960
gpt3 to just go ahead and write my um write the
uh Whatchamacallit for us the the knowledge graph
0:00:54.960,0:01:03.600
Json and so what I'm doing is the nodes that I
asked for was each node should be a case citation
0:01:03.600,0:01:08.520
precedent or prior opinion each node should
have several properties such as date case number
0:01:08.520,0:01:14.700
involved parties reasoning for including in this
opinion and other relevant information so that
0:01:15.360,0:01:19.260
um worked really well because you see like it's
got a case number it tells me when it happened
0:01:19.260,0:01:26.760
and then the reasoning so like involve parties
like great this is Phenom this is phenomenal
0:01:26.760,0:01:32.400
information so what we're doing is we're going
to create a cross-linked web as to like why
0:01:32.400,0:01:40.080
all these things are interlinked so that way
theoretically if a uh the the this hypothetical
0:01:40.080,0:01:46.560
use case if an attorney is researching antitrust
laws so that one they can you know go to a court
0:01:46.560,0:01:51.600
court of appeals or even present to the Supreme
Court they will have a masterful understanding
0:01:51.600,0:01:59.100
of established law and the reason that this is
important is because common law is um how law
0:01:59.100,0:02:05.040
Works in America it's by prior precedent is the
so you've got the the laws that are laid down by
0:02:05.040,0:02:10.920
the legislative branch and then interpreted by the
judicial branch and so the judicial branch keeps
0:02:10.920,0:02:15.660
track of their own interpretation right because
there's the legislative branches it it has to do
0:02:15.660,0:02:21.900
with separation of powers anyways so that's how
we got Where We Are um it worked really well and
0:02:21.900,0:02:26.760
um there's all kinds of what-ifs and gotchas that
I'm not going to worry about because this worked
0:02:26.760,0:02:32.820
really well and we can iterate over the over the
long term so but this I had to plug in manually
0:02:33.480,0:02:38.940
and it's about four pages so like if we do
if we search for how many new pages there are
0:02:39.780,0:02:44.580
one two three yeah one two three
so that this has four pages total
0:02:45.840,0:02:51.840
um which the that that was good so the
next problem is we got to take these and
0:02:51.840,0:02:57.300
regardless of how long they are we have to
break them down into chunks of four pages
0:02:57.300,0:03:04.860
maximum so rather than do this manually I was
like why don't I just ask chat GPT so here we go
0:03:06.480,0:03:21.300
um I have a folder named um uh let's see
um opinions uh text um this folder contains
0:03:22.380,0:03:38.880
um uh text files of scotus decisions um that
were converted from PDFs the pages are demarcated
0:03:39.480,0:03:50.160
by um by the words new page I need a python
function so I'm basically talking to this
0:03:50.160,0:03:56.880
thing like I'd be talking to a developer
I need the python function where I pass um
0:03:58.980,0:04:05.100
pass uh actually no I actually let's
just ask it to do the whole thing I
0:04:05.100,0:04:17.160
need a python function that um reads every
uh uh file in opinions underscore text um
0:04:19.980,0:04:29.640
and then breaks each file
into chunks of four pages um
0:04:30.840,0:04:45.120
the purpose of this is to um limit the size of
each subsequent file please then save the chunks
0:04:46.500,0:05:02.340
into um into a folder named chunks text
um and append um a serial number to
0:05:02.880,0:05:15.720
the original file name for instance um you know
star underscore one star underscore two and so on
0:05:17.700,0:05:18.720
all right let's see what it does
0:05:23.880,0:05:27.480
why would this violate the content policy
0:05:31.980,0:05:38.700
okay I sent in feedback um let's see
what we've got so import OS split text
0:05:38.700,0:05:43.140
files input folder output folder pages
per chunk so good it's parameterized it
0:05:44.340,0:05:52.380
um if not uh output folder make it nice
okay for file and Os Lister input folder
0:05:53.640,0:06:01.680
um with open OS path join input folder file r
as F read it excellent Pages equals split on
0:06:01.680,0:06:08.640
new page perfect so it understood that for I and
chunk enumerate page range zero pages per chunk
0:06:08.640,0:06:14.880
excellent so this is this is a list comprehension
that will break it into equal chunks or chunks
0:06:14.880,0:06:21.840
of four output file equals this underscore
plus oh dang that's good that's good okay
0:06:23.640,0:06:27.060
um this is wonderful so let's go to this
0:06:28.140,0:06:31.080
um I had started writing it and then I
was like I don't have the energy for this
0:06:32.100,0:06:39.360
I'm telling you I have said it on Twitter and
I've said it on mostly Twitter but LinkedIn
0:06:39.360,0:06:44.880
and a few other places English your ability
to describe what you want is going to be the
0:06:44.880,0:06:52.140
primary programming language from now on period
end of story if you can think through a function
0:06:53.340,0:07:02.100
then um yeah this is this is it so let's run
this let's see if this worked um need a command
0:07:02.100,0:07:09.660
prompt CD and we're in the scotus opinions um
and then we'll do python step 01 split chunks
0:07:11.700,0:07:17.460
why you no work um this actually happens
quite a bit let's see if it can fix it
0:07:20.640,0:07:29.460
um so thanks that mostly
worked but through this error
0:07:32.760,0:07:39.060
and you fixed that error please now if this
can debug this I know exactly what it is
0:07:40.800,0:07:44.040
yep yep there you go
0:07:53.700,0:07:57.600
you need to specify the encoding yep perfect
0:08:02.280,0:08:05.460
oh wow okay it's saying add the ignore flag
0:08:08.280,0:08:12.900
wonderful wonderful adding
encoding utf-8 should be sufficient
0:08:15.060,0:08:16.680
um so that goes to do
0:08:18.720,0:08:19.560
right here
0:08:25.920,0:08:28.200
hmm fascinating
0:08:31.860,0:08:33.240
did it get farther
0:08:35.640,0:08:40.980
this is weird though input folder
file contents read encoding utf-8
0:08:40.980,0:08:48.060
as infile so sometimes this happens um
let's see what is the encoding here so
0:08:48.060,0:08:55.020
sometimes what I do is I'll convert it
to something and then I'll convert back
0:08:57.660,0:08:58.260
so let's
0:09:02.400,0:09:04.200
yeah hmm
0:09:07.080,0:09:12.000
I wonder what happens if we do let's
add let's add that ignore thing rather
0:09:12.000,0:09:15.480
than rather than getting lost in the
weeds let's just do errors equal ignore
0:09:18.660,0:09:21.420
because you know what I looked at
the text file it looks fine to me
0:09:23.640,0:09:30.780
oh and let's also see if it um okay it got
pretty far already so it got pretty far in
0:09:30.780,0:09:37.200
the process before blowing up one two three
so now we've got all the chunks cool cool cool
0:09:39.180,0:09:43.980
um all right heck with it send it what do you mean
0:09:47.580,0:09:48.900
this was in
0:09:51.180,0:09:57.240
so this was in Split text files F right
oh it couldn't write it interesting
0:09:59.640,0:10:07.380
oh okay okay okay um so when we write it we
need to we need to convert it to ASCII and back
0:10:08.400,0:10:11.280
um so let's see if it understands this
0:10:13.740,0:10:17.100
oh man I'll be I'll be jazzed if
it if it understands this okay
0:10:17.100,0:10:25.140
um great that worked uh now I have
a new problem here's the error
0:10:27.300,0:10:31.740
um please find a solution for this new bug
0:10:34.440,0:10:37.440
so what so the problem here is and I
0:10:38.400,0:10:44.760
um yeah that you're trying to join a list of
strings that contain spaces I don't think so
0:10:53.700,0:10:55.740
new page nope that's not it
0:11:04.140,0:11:16.260
replace no no I don't think that's it I
think we need to fix the um string encoding
0:11:17.280,0:11:20.040
um here's the last part of the error
0:11:26.040,0:11:29.400
I feel like I'm talking to like Hal 9000.
0:11:32.520,0:11:34.500
yep there it is that's it
0:11:39.660,0:11:40.260
yep
0:11:43.020,0:11:47.760
ah so we need to add the encoding to
to utf-8 right okay that makes sense
0:11:49.080,0:11:50.700
this thing is smarter than me
0:11:54.000,0:11:57.720
um okay so when we write it yeah
oh that's the problem I okay
0:11:59.400,0:12:05.760
um encoding equals UTF eight okay cool
but uh this took way less brain power
0:12:06.780,0:12:12.480
okay let's see if that works DLS clear
screen that was fast all right so now
0:12:12.480,0:12:26.980
we have 144 chunks of text let's see what the
biggest one is 20 uh [Laughter] uh [Laughter]
0:12:31.740,0:12:35.040
I think the uh I think the
OCR messed up with this one
0:12:38.160,0:12:45.360
why do you believe that the Sherman Act
okay I'm probably disturbing my audience
0:12:47.100,0:12:50.700
what's funny is that it got the capital
letters fine but then the rest it's like
0:12:50.700,0:12:57.360
no I wonder if it was like if the scanner bed
was like moving too slowly or something foreign
0:13:01.380,0:13:06.480
I'm sorry okay I'm gonna pause this
because I'm gonna keep laughing
0:13:11.400,0:13:15.300
okay I had to stop and make myself tea
I'm still probably gonna be looking a
0:13:15.300,0:13:24.240
bit I'm sorry okay now normally oh sorry oh
I don't know why this is so funny normally
0:13:25.440,0:13:30.720
um this would uh what I would do is just
delete data but in this case I don't want to
0:13:30.720,0:13:36.420
lose anything and imagine like what if half of
our files were like this so let's come up with
0:13:36.420,0:13:43.740
a solution so I'm going to with so this this this
this function works so I'm going to say okay cool
0:13:44.340,0:13:54.480
let's make a new script so let's go back over to
Chad GPT great that all worked um but it turns
0:13:54.480,0:14:06.600
out the OCR barfed on a few files so the um so
there are lots of duplicated characters such as
0:14:09.360,0:14:22.320
um please write a new script using Spacey or
nltk to um de-duplicate uh extra characters
0:14:23.880,0:14:34.380
um search through all the files in um see
what was the folder name chunks.text in
0:14:35.880,0:14:48.360
chunks underscore text um deduplicate deduplicate
characters um and then save the cleaned up copy
0:14:49.860,0:14:55.080
um in a new folder called um dedupt text
0:15:05.640,0:15:06.420
all right
0:15:40.560,0:15:41.400
will that work
0:15:46.440,0:15:50.160
I don't know if that'll work is Spacey that good
0:15:54.060,0:15:55.320
or T and Doc
0:15:59.160,0:16:04.260
um okay I think I'll need to add the
encoding but let's give it a shot
0:16:06.720,0:16:10.500
and let me make sure I've
got Pippin pip install Spacey
0:16:12.480,0:16:19.920
okay so while that's running I will do
save and we'll come back here and we'll do
0:16:22.500,0:16:31.080
um Step oh two oh so you might notice I started
I I um I do my functions like in order so that
0:16:31.080,0:16:35.340
way like if you're looking at this repo you don't
have to like try and reverse engineer what order
0:16:35.340,0:16:43.200
to run things in like it's just intrinsic
documentation um dedupe characters dot pi
0:16:45.600,0:16:59.220
okay python step O2 and I know that it's going to
barf on this so let's do any coding E equals UTF
0:17:00.180,0:17:05.700
utf-9 utf-9 is clearly better than
utf-8 right it's the next mission
0:17:07.440,0:17:13.140
sorry I'm in a mood today apparently
you having a giggle Mite okay um
0:17:15.420,0:17:16.380
let's see if it works
0:17:20.400,0:17:24.540
I'm not going to use the GPU for
something this simple it's a small
0:17:25.260,0:17:31.080
skipping registered can't find and it
doesn't seem to be a python package huh
0:17:37.500,0:17:43.620
let's see um it gave me this error
0:17:46.740,0:17:48.000
what does it mean
0:17:50.400,0:17:51.660
it didn't work
0:17:58.140,0:18:00.600
and this is the value of
having good error messages kids
0:18:17.400,0:18:19.800
interesting all right
0:18:31.140,0:18:35.220
let's see if that worked still didn't work
0:18:40.080,0:18:48.780
um okay all right so in this case it seems like we
found a limitation still didn't work Spacey load
0:18:54.960,0:18:56.940
says requirement already satisfied
0:18:59.040,0:19:02.220
yeah it says it's already satisfied um
0:19:04.320,0:19:09.180
let's go let's go to Google
let's see if uh let's see if this
0:19:18.540,0:19:19.380
all right
0:19:26.280,0:19:26.880
import
0:19:35.040,0:19:37.500
so it looks like you need
to do you need to install
0:19:49.740,0:19:50.640
let's try this
0:19:56.700,0:19:58.080
that looks like it did a thing
0:20:05.880,0:20:09.840
sorry I wasn't talking through it so
basically I was looking I found I found
0:20:09.840,0:20:16.800
a Spacey issue on GitHub and um let's see it
hasn't bombed yet so I wonder if it's working
0:20:18.780,0:20:26.220
um dedupt cool there's nothing
that's 20 uh no it didn't work
0:20:28.620,0:20:29.940
still didn't work okay
0:20:33.060,0:20:33.560
um
0:20:39.240,0:20:43.920
yeah so seems like Spacey is not
sufficient or at least this isn't
0:20:45.060,0:20:56.160
um okay so let's go back here go back to this um
okay I got Spacey to load but the original script
0:20:56.700,0:21:07.860
does not work the words um in some files are
still wrong um with lots of duplicated characters
0:21:10.080,0:21:11.100
um so for instance
0:21:16.680,0:21:17.580
for example
0:21:23.460,0:21:25.320
I probably should ask another way
0:21:27.360,0:21:27.860
yep
0:21:30.540,0:21:33.120
yep okay so it's actually going to tell me
0:21:33.720,0:21:39.360
yeah use regex so that's regex is actually the
way I would have done it originally so it looks
0:21:39.360,0:21:44.100
like this might be the right way to do it it
so all right let's see what it says foreign
0:21:47.760,0:21:50.580
do the same thing okay
0:21:58.320,0:21:59.220
interesting
0:22:01.740,0:22:13.980
okay um let's give this a try I will give
that a try but what about uh words where
0:22:14.760,0:22:25.740
um there are supposed to be duplicated
characters like book or look
0:22:27.960,0:22:35.220
um does yours script handle that and we use nltk
0:22:36.360,0:22:43.380
or Spacey or something else
to account for correct words
0:22:51.480,0:22:55.800
also while this is running I'm going to address
the elephant in the room and that was it like a
0:22:55.800,0:23:02.460
week ago I made a video saying like meh chat GPT
isn't that great um and uh so obviously like I
0:23:02.460,0:23:09.480
kind of have my foot in my mouth because this
is amazing it also looks like it failed um so
0:23:10.800,0:23:16.080
all right let me do a time check we are at
23 minutes so we're almost done for the day
0:23:16.920,0:23:18.540
um oh there we go okay
0:23:30.660,0:23:39.360
all right uh it's yeah yeah it'll it's gonna
give me that thing so I think what we'll do is
0:23:40.380,0:23:46.440
we'll just check to see if this happens
because it's better to have some data than none
0:23:46.980,0:23:53.460
so uh I think what we'll do all right split
chunks that one worked dedupe characters so
0:23:53.460,0:24:02.640
what we need to do is find the ones where this
happened right so um generally it looks like
0:24:08.160,0:24:09.120
okay
0:24:13.980,0:24:18.060
looks like it only happened
in one document yeah okay so
0:24:19.860,0:24:29.760
in this case we'll say we'll pass in any file
that um starts with this so we'll we'll uh we'll
0:24:29.760,0:24:35.940
we'll eat the eat the cost and what I mean by
that is we'll just Excel accept that some data
0:24:35.940,0:24:40.380
is not going to be quite as good but that's
plausible because like it's an OCR mistake
0:24:40.380,0:24:51.900
but the thing is is we just need it to be small
enough to fit in um so de-duplicate characters uh
0:25:01.740,0:25:03.600
yeah I don't I don't think this is gonna work
0:25:06.240,0:25:14.760
so it it's close but it's not quite um
thanks um can you modify the uh the uh
0:25:15.900,0:25:27.480
the script that uses regex instead um the
only problem is in files with and the name
0:25:29.520,0:25:35.400
can you just sort um or filter
filter out any other files
0:25:39.060,0:25:44.580
okay so let's see what it comes up with there
meanwhile it looks like this is the largest
0:25:44.580,0:25:52.500
one that is um not good right so it's 13 000
characters long so let's see how many tokens
0:25:52.500,0:26:01.140
this is so as long as it's that's getting pretty
close um let's see if we can if if our uh prompt
0:26:03.120,0:26:05.760
will work so prompt
0:26:15.180,0:26:21.840
and then we'll do text DaVinci 03 temperature
zero and our maximum length is going to be
0:26:21.840,0:26:28.800
like 600 tokens otherwise it's going to be too
long so let's see if let's see if that's enough
0:26:29.820,0:26:31.980
the output is pretty short so
0:26:34.020,0:26:38.880
but Json is pretty token intensive so I
wouldn't be surprised if it yeah so we
0:26:38.880,0:26:43.080
ran out of space so it looks like we
need to go back and do smaller chunks
0:26:44.340,0:26:49.680
um but fortunately these earlier scripts
are really easy so for instance we just come
0:26:49.680,0:26:56.460
back here and instead of doing um uh we we just
update this so instead of four pages we do three
0:26:57.420,0:27:03.960
and so then what we'll do is we'll come in here
to opinions is fine chunks let's delete these
0:27:05.220,0:27:13.980
and then we'll rerun um python step 01 and
so now what we should have is more chunks
0:27:15.420,0:27:19.140
so now we have 188 chunks but you see the the um
0:27:21.240,0:27:23.100
that one that one
0:27:24.720,0:27:30.900
so this is this is now the largest and it's
only 10 kilobytes or nine nine thousand
0:27:31.620,0:27:37.500
um characters long so this should be small
enough to run let's see how many tokens it is
0:27:43.380,0:27:47.400
um okay so that's now we're down to
2800 tokens let me remove the Json part
0:27:48.420,0:27:53.820
so now we're at 2500 tokens
tops so we can do we can do
0:27:54.420,0:28:00.960
um we can do like 1450 so that's more than
twice as many tokens so let's see it's it's
0:28:00.960,0:28:08.040
more than twice as many tokens with uh 25 less
input so theoretically this should be the limit
0:28:10.860,0:28:16.020
um cool all right excellent I think this is good
0:28:18.360,0:28:24.480
all right so we're almost done for the
day um let's see what chat GPT said
0:28:25.200,0:28:30.780
do you duplicate characters yeah there we
go okay so let's do this for the D dupe
0:28:33.180,0:28:41.580
um so let's come back here save it and
then we'll do python step O2 dedupe
0:28:42.600,0:28:44.640
all right we needed to add the um
0:28:45.960,0:28:57.720
Whatchamacallit encoding encoding equals UTF eight
and then for the right and coding equals UTF eight
0:29:00.120,0:29:05.700
okay so now if we go to the dedupt oh
actually here we need to delete this
0:29:09.300,0:29:12.420
it did all of them interesting um
0:29:18.900,0:29:22.080
oh it that's no if if not in file
0:29:25.260,0:29:27.180
it got the logic wrong I'm like no it was
0:29:27.180,0:29:29.820
only supposed to do like five
of these let's try that again
0:29:32.760,0:29:36.780
there we go okay so now the largest
of these is eight kilobytes which is
0:29:36.780,0:29:43.500
plenty small so we come back to chunks and
we'll replace those replace yes so now our
0:29:43.500,0:29:50.220
absolute largest file is 10 kilobytes 9 300
characters so this should all easily fit within
0:29:51.360,0:30:00.840
um no this should all easily fit within our prompt
window here so now we need to just go ahead and um
0:30:03.060,0:30:06.540
run the thing so all right so let's say
0:30:09.540,0:30:22.080
excellent we are now ready to run our uh prompts
are you familiar with open AI python module
0:30:29.280,0:30:34.380
open AI is dot dot dot I'm wondering
if it's slowing down because
0:30:34.920,0:30:40.800
um uh people are waking up this is one advantage
of being a super early bird is I get up earlier
0:30:40.800,0:30:44.940
than like everyone else granted you know it's
always middle of the day somewhere in the world
0:30:46.500,0:30:57.540
um anyways uh so this may or may not help us with
um with this part so now let's see open file save
0:30:57.540,0:31:09.240
file open AI key text DaVinci 03 tempt that that
we can do this up to 1450 that's fine in code
0:31:11.340,0:31:19.680
all right so this I just cannibalized another um
thing um we do not want to clean that up because
0:31:19.680,0:31:26.700
this is going to be there that's fine all right
network error yeah I figured that would happen
0:31:27.600,0:31:36.180
um okay so let's do a refresh um I have
a folder called what is the name of my
0:31:36.180,0:31:44.700
folder God I have like chipmunk chip monk memory
it's hard chipmunk memory um called chunks.text
0:31:47.640,0:31:53.880
called chunks underscore text um
that is full of dot text files
0:31:54.840,0:32:04.620
um each file um please write a python
script that uses open AI module and
0:32:05.520,0:32:19.020
new module and calls um and uses text DaVinci
O3 as the engine we'll say as the model um
0:32:21.420,0:32:33.660
and uses the contents of the file
in the chunks folder to populate
0:32:34.920,0:32:47.820
a prompt The Prompt is stored as um let's
see prompt underscore Json LD uh citation
0:32:47.820,0:32:56.760
nodes dot text and has a placeholder called
um what did I call the placeholder chunk
0:32:59.580,0:33:03.120
called chunk um in other words
0:33:05.940,0:33:14.520
open the text file or no um let's
see open the prompt file then replace
0:33:17.880,0:33:27.840
chunk with the contents of the file from
chunks.text and use this as the prompt for
0:33:28.920,0:33:39.900
um openai set the temperature to zero and the
token count to 1450. let's see if that works
0:33:42.300,0:33:44.760
certainly here's a script that
should do what you have described
0:33:48.240,0:33:52.680
I do it differently because I use
Windows I know I'm a charlatan um
0:34:13.500,0:34:17.040
I think it's going to work because
0:34:23.760,0:34:33.120
so this is the the implications of this if
chat GPT knows how to call GPT in theory you
0:34:33.120,0:34:36.960
can create a machine that can do its
own experiments with language models
0:34:38.940,0:34:45.660
I want to say that again the implication here is
now we have a system of a machine that understands
0:34:45.660,0:34:51.600
the code to call itself well enough and then
obviously this thing is intelligent enough to
0:34:51.600,0:34:57.960
understand results you could in theory have
something that trains itself or makes its
0:34:57.960,0:35:09.180
own data sets or whatever um okay so I like this
whoops copy code let's come over here and do this
0:35:10.440,0:35:15.300
um I guess I didn't say what
to do with the response um
0:35:17.460,0:35:20.100
so we'll do let's see
0:35:24.720,0:35:30.660
um [Music] and then yeah we
don't need any of this stuff
0:35:35.700,0:35:44.880
so that's fine um and then we need to
rather than print um we'll need to save
0:35:47.820,0:35:58.860
great um the output from the the uh
let's see great now response dot text
0:35:59.580,0:36:05.880
should be in Json format can you
please save the output to a new folder
0:36:06.480,0:36:14.520
called um uh let's see we'll
say called AG underscore Json um
0:36:17.880,0:36:22.440
instead of just printing um we should
0:36:24.180,0:36:36.360
uh let's see otherwise uh say use the same file
name as the dot text file but just replace the dot
0:36:36.360,0:36:49.800
text with DOT Json um let's see before you write
this script can you uh tell me if you understand
0:36:51.840,0:37:05.700
um give me some pseudocode so I can check to make
sure we understand each other if this works yeah
0:37:15.420,0:37:17.460
well it understands the concept of pseudocode
0:37:19.560,0:37:23.280
and then it's going to go ahead and write the
script Okay cool so let's see if it fixes it
0:37:24.420,0:37:30.000
um asking for the pseudocode may or may not
be viable especially because this code is
0:37:30.000,0:37:36.000
so simple it might have been easier just to
ask for it to um to just go ahead and do it
0:37:37.080,0:37:49.680
um but yeah so we're also going to need to do
um in coding equals utf-8 always need to do that
0:37:55.260,0:37:56.580
here I'll just copy this
0:38:01.320,0:38:02.700