Researchers! On 19 December 2024, a preprint paper was published that focuses on "evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation." The 4DS-j model presented there achieves significantly better monocular depth estimation results than DINOv2 ViT-g, making it a better backbone than DINOv2 for specialised video depth estimation models that can be the basis for better 2D to 3D video conversion, too! Please try to implement the 4DS-j backbone instead of DINOv2 ViT-g for your future breakthrough video depth estimation models! Below is a special ranking showing the capabilities of 4DS-j:
Due to the recent number of new models that I am unable to add to the rankings immediately, I have decided to add a waiting list of new models:
- Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.078
- NYU-Depth V2: AbsRel<=0.045 (relative depth)
- NYU-Depth V2: AbsRel<=0.051 (metric depth)
- NYU-Depth V2 (640×480): AbsRel<=0.058 (old layout - currently no longer up to date)
- DA-2K (mostly 1500×2000): Acc (%)>=86 (old layout - currently no longer up to date)
- UnrealStereo4K (3840×2160): AbsRel<=0.04 (old layout - currently no longer up to date)
- Middlebury2021 (1920×1080): SqRel<=0.5 (old layout - currently no longer up to date)
- Appendix 1: Rules for qualifying models for the rankings (to do)
- Appendix 2: Metrics selection for the rankings (to do)
- Appendix 3: List of all research papers from the above rankings
📝 Note: There are no quantitative comparison results of StereoCrafter yet, so this ranking is based on my own perceptual judgement of the qualitative comparison results shown in Figure 7. One output frame (right view) is compared with one input frame (left view) from the video clip: 22_dogskateboarder and one output frame (right view) is compared with one input frame (left view) from the video clip: scooter-black
RK | Model Links: Venue Repository |
Rank ↓ (human perceptual judgment) |
---|---|---|
1 | StereoCrafter |
1 |
2-3 | Immersity AI | 2-3 |
2-3 | Owl3D | 2-3 |
4 | Deep3D |
4 |
RK | Model Links: Venue Repository |
TAE ↓ {Input fr.} DAV |
---|---|---|
1 | Depth Any Video |
2.1 {MF} |
2 | DepthCrafter |
2.2 {MF} |
3 | ChronoDepth |
2.3 {MF} |
4 | NVDS |
3.7 {4} |
RK | Model Links: Venue Repository |
OPW ↓ {Input fr.} FD |
OPW ↓ {Input fr.} NVDS+ |
OPW ↓ {Input fr.} NVDS |
---|---|---|---|---|
1 | FutureDepth |
0.303 {4} | - | - |
2 | NVDS+ |
- | 0.339 {4} | - |
3 | NVDS |
0.364 {4} | - | 0.364 {4} |
📝 Note: This ranking is based on data from Table 4. The example result 3:0:2 (first left in the first row) means that Depth Pro has a better F-score than UniDepth-V in 3 datasets, in no dataset has the same F-score as UniDepth-V and has a worse F-score compared to UniDepth-V in 2 datasets.
RK | Model Links: Venue Repository |
AbsRel ↓ {Input fr.} MonST3R |
AbsRel ↓ {Input fr.} DC |
---|---|---|---|
1 | MonST3R |
0.063 {MF} | - |
2 | DepthCrafter |
0.075 {MF} | 0.075 {MF} |
3 | Depth Anything |
- | 0.078 {1} |
RK | Model Links: Venue Repository |
AbsRel ↓ {Input fr.} M3D v2 |
AbsRel ↓ {Input fr.} GRIN |
- | - | - | - | - |
---|---|---|---|---|---|---|---|---|
1 | Metric3D v2 ViT-giant |
0.045 {1} | - | - | - | - | - | - |
2 | GRIN_FT_NI |
- | 0.051 {1} | - | - | - | - | - |
RK | Model | AbsRel ↓ {Input fr.} |
Training dataset |
Official repository |
Practical model |
Vapour- Synth |
---|---|---|---|---|---|---|
1 | ZoeDepth +PFR=128 ENH: |
0.0388 {1} |
ENH: UnrealStereo4K |
ENH: |
- | - |
RK | Model | SqRel ↓ {Input fr.} |
Training dataset |
Official repository |
Practical model |
VapourSynth |
---|---|---|---|---|---|---|
1 | LeReS-GBDMF ENH: |
0.444 {1} |
ENH: HR-WSI |
ENH: |
- | - |